P0981 A Perl script for targeted local genome assembly

Charles Crane , Purdue University, West Lafayette, IN
Subhashree Subramanyam , Purdue University, West Lafayette, IN
Jill A. Nemacheck , USDA-ARS Crop Production and Pest Control Research Unit, West Lafayette, IN
Christie Williams , Purdue University, West Lafayette, IN
Whenever a finished genome is unavailable, the characterization of gene families, promoters, and enhancers, would benefit from a program for de novo assembly around a user-supplied initial sequence.  The iterative script described here uses blast and phrap for this purpose.  At each cycle, the script identifies matching reads with blast, identifies and retrieves low-copy reads that hit the contigs retained from the previous cycle, and assembles those reads with phrap.  Cycles continue until the number of new reads to be added falls below a user-specified fraction of the total reads retrieved, or until phrap fails to assemble the reads.  The initial sequence can be protein or nucleotide, but subsequent searches use blastn against the contigs from the previous cycle that best match the initial sequence.  Thus contigs “grow” until they encounter repetitive sequence or insufficient depth of coverage in the reads database.  The script was tested with four DNA sequences encoding a putative dirigent-like protein (HfrDir) from wheat, using pyrosequencing reads from cerealsdb (www.cerealsdb.uk.net) to assemble 30 contigs that matched at least one of the initial sequences at an e-value < 1e-12 and ranged from 598 to 7395 bases in length.  The contigs obtained did not precisely match the number or length of dirigent-positive contigs in the cerealsdb draft assembly, and thus offer a different view of the dirigent-like gene family.  From five of the contigs, 14 contig-specific primer pairs were used for PCR on Chinese Spring wheat; all produced single amplicons, and all but two primers yielded Sanger sequence.  The Sanger sequences differed from the assembled contigs by occasional SNPs and one 46-base deletion, and these differences are under further investigation.  However, the assembled sequences appear to be sufficiently accurate to direct further investigation of a gene and its adjacent environment from any collection of reads having sufficient length and coverage.