P0990 Extensive Analysis of Parameter Tuning Effects in a de novo Assembly Pipeline Based on Short Reads

Hachiya Tsuyoshi , Keio University, Yokohama, Japan
Sakakibara Yasubumi , Keio University, Yokohama, Japan
Recent studies have shown that short read sequences generated from massively parallel sequencing technologies (e.g., Illumina Genome Analyzer / HiSeq2000) can be assembled into sufficiently long contigs and scaffolds. Since the publish of the panda genome paper in January 2010, tens of plant and animal genomes have been de novo assembled based mainly on the short read sequences. Now, de novo assembly based on short reads is becoming to be a common strategy for unraveling biology of non-model organisms, and to be a bioinformatics routine analysis that can be performed in middle-scale laboratories as in large-scale institutes. An important issue in de novo assembly of large plant and animal genomes based on short reads is that there are a number of parameters in the assembly pipeline that could significantly affect assembled results. Because empirical and theoretical effects of those parameters on assembly accuracies have not been fully examined, extensive analyses of parameter effects are needed to discuss efficient parameter tuning strategies. Here, we constructed a de novo genome assembly pipeline for large plant and animal genomes using SOAPdenovo assembler, and extensively examined the effect of parameters in our pipeline using two publicly available datasets (potato as an example of large plant genome, and naked mole rat as an example of large animal genome). Parameters examined in our empirical analysis include read trimming/clipping parameters, k-mer length, bubble merge level, scaffolding parameters, and gap closing parameters. For each parameter setting, we evaluated assembly accuracies by Feature-Response Curve and N50/N90 statistics.