Towards Copy-Aware Assembly of the Sugarcane Genome

Margarido, Gabriel R.A.

The sugarcane genome consists of multiple copies of most genomic regions. Correctly sorting these multiple copies is important for assembly, as it can for example facilitate the discovery of differentially expressed promoters for a given gene. Nonetheless, current diploid-oriented assemblers collapse multiple copies into the same contig. We have developed an assembly approach that avoids the collapse. First, a layout algorithm threads reads through existing putative contigs, identifying true polymorphisms (both SNPs and rearrangements) and sequencing errors. Next, a clustering algorithm based on a mixture model that incorporates the error model for the sequencing technology refines these initial read assignments (and reads rejected from the initial layout). We will describe results of this approach on synthetic data generated from a hypothetical polyploidization with rearrangements of the sorghum genome, the closest diploid species to sugarcane.

W693 Towards Copy-Aware Assembly of the Sugarcane Genome