P0239 A pipeline for transcriptome assembly and SNP identification in highly heterozygous crops

Tom Ruttink , Institute for Agricultural and Fisheries Research (ILVO), Melle, Belgium
Lieven Sterck , VIB / Ghent University, Bioinformatics and Systems Biology, Gent, Belgium
Isabel Roldan-Ruiz , ILVO, Melle, Belgium
The genomes of obligate cross-pollinating crops are characterised by a high degree of heterozygosity and heterogeneity within populations and cultivars. This high level of diversity can be exploited using NGS sequencing for the development of molecular markers for linkage map construction, association genetics or genomic selection. We have developed a bioinformatics pipeline for transcriptome assembly and annotation in crop species for which no genome sequence is available. As proof of principle, we reconstructed a L. perenne reference transcriptome from Illumina RNA-seq data, using the predicted protein set of Brachypodium as guidance for clustering and annotation. In a four-step procedure, we perform de novo transcript assembly in 14 L. perenne genotypes; group contigs from the 14 genotypes based on BLAST searches with all Brachypodium proteins; cluster each group of contigs using CAP3; and filter the CAP3-contigs using orthology criteria for each Brachypodium protein and thus annotate the L. perenne transcriptome. This procedure yields an annotated L. perenne reference transcriptome that represents homologs of 72% of the 26,552 predicted Brachypodium proteins. Of these, about 8.000 contigs represent full-length transcripts, showing that the procedure resolves the high level of contig fragmentation that typically occurs during de novo assembly of highly polymorphic species. The L. perenne transcriptome database is ‘synteny anchored’ to the Brachypodium reference genome, and facilitates candidate gene discovery. Read mapping and SNP analysis revealed a total of 400.000 SNPs in about 18.000 genes, with an average SNP density of 1 SNP per 50 bp.