W693 Towards Copy-Aware Assembly of the Sugarcane Genome

Date: Sunday, January 15, 2012
Time: 3:10 PM
Room: Royal Palm Salon 1-2
Gabriel R.A. Margarido , Microsoft Research, Los Angeles, CA
Cristina Pop , Microsoft Research, Los Angeles, CA
Bob Davidson , Microsoft Research, Redmond, WA
Glaucia Souza , University of Sao Paulo, Sao Paulo, Brazil
David Heckerman , Microsoft, Los Angeles, CA
The sugarcane genome consists of multiple copies of most genomic regions.  Correctly sorting these multiple copies is important for assembly, as it can for example facilitate the discovery of differentially expressed promoters for a given gene.  Nonetheless, current diploid-oriented assemblers collapse multiple copies into the same contig.  We have developed an assembly approach that avoids the collapse.  First, a layout algorithm threads reads through existing putative contigs, identifying true polymorphisms (both SNPs and rearrangements) and sequencing errors. Next, a clustering algorithm based on a mixture model that incorporates the error model for the sequencing technology refines these initial read assignments (and reads rejected from the initial layout).   We will describe results of this approach on synthetic data generated from a hypothetical polyploidization with rearrangements of the sorghum genome, the closest diploid species to sugarcane.