W512 Assembly of Huge Plant Genomes From WGS Data

Date: Saturday, January 14, 2012
Time: 9:30 AM
Room: Esquire
Aleksey Zimin , University of Maryland, College Park, MD
Guillaume Marcais , University of Maryland, College Park, MD
Michael Roberts , Univeristy of Maryland, College Park, MD
Steven L Salzberg , Johh Hopkins University School of Medicine, Baltimore, MD
James A Yorke , University of Maryland, Colluge Park, MD
Until recently, most genome assembly used reads as the basic building blocks, computing all overlaps between reads as one of the primary steps in the algorithm. This approach can require very large memory and excessive computing resources when the number of reads is extremely high, as it is for next-generation sequencing (NGS) projects with high depth of coverage, in which several billion reads might be generated. To overcome this problem, newer assemblers for NGS data use algorithms based on the de Bruijn graph, whose nodes comprise all the k-mers (k consecutive bases) in the reads. The assembly algorithm then constructs contigs by finding unique Eulerian tours in the graph, which can be done in linear time. Our approach is a hybrid algorithm, the first to use both the de Bruijn graph and the overlap strategy. After error-correcting the reads, we use a de Bruijn graph to extend each read in both directions, creating a “super-read”. Usually many reads will extend into the same super-read, yielding very considerable data reduction. Since all reads are contained in super-reads, no information is lost. We assemble the super-reads using an overlap-based algorithm, an updated version of the Celera Assembler. We call this combination the Maryland Super-Reads Celera Assembler (MSR-CA). It is available as open-source code at http://ftp.genome.umd.edu/pub/MSR-CA/. MSR-CA is also able to assemble mixtures of Illumina, 454, and Sanger data, making it one of the only assemblers capable of handling such mixtures.