W422 The TriAnnot Automated Annotation Pipeline: Making Sense of the Output Files and Information - a Case Study

Date: Tuesday, January 17, 2012
Time: 3:45 PM
Room: Golden Ballroom
Nicolas Guilhot , INRA GDEC, Clermont-Ferrand, France
Philippe Leroy , INRA GDEC, Clermont-Ferrand, France
Sébastien Theil , INRA GDEC, Clermont-Ferrand, France
Frederic Choulet , INRA GDEC, Clermont-Ferrand, France
Sébastien Reboux , INRA - URGI, Research Unit Genomic-Info, Versailles, France
Michael Alaux , INRA - URGI, Research Unit Genomic-Info, Versailles, France
Matthieu Reichstadt , UMR1019 Unité de Recherche en Nutrition Humaine, Institut National de la Recherche Agronomique, Saint-Genès-Champanelle, France
Hadi Quesneville , INRA - URGI, Research Unit Genomic-Info, Versailles, France
Catherine Feuillet , INRA GDEC, Clermont-Ferrand, France
Genome annotation is one of the most difficult tasks in genome sequencing projects, but it is essential for connecting genome sequence to biology. With the advent of next-generation sequencing technologies new genomes are being sequenced at a faster rate than they are being fully and correctly annotated. To manage the large amount of data generated by >1Gb genome size sequencing projects, sequence annotation needs to be automated.To achieve a systematic and comprehensive annotation of the bread wheat genome sequence, a parallelized automated pipeline, called TriAnnot (http://www.clermont.inra.fr/triannot), has been developed under the umbrella of the IWGSC (http://www.wheatgenome.org), and installed on a cluster of 712 cores (60 TB, 8.5 Tflops) . The goals of TriAnnot are to provide the international scientific community with an online user-friendly interface for simple BAC or BAC contig analysis and to facilitate large scale analysis such as the annotation of the ~1 Gb wheat chromosome 3B sequence. The modular architecture of the TriAnnot pipeline allows the annotation of repeats and Transposable Elements (TEs), protein-coding genes structural and functional annotation, RNA-coding genes and other biological features identifications. The pipeline uses 73 databanks and 21 bio informatics programs. EMBL/GFF output files can be displayed with GBrowse, Artemis and GenomeView to help further manual expertise. The pipeline can be adapted to the annotation of other plant species. We will explain how to use TriAnnot for small (web interface) or large scale analyses (unix command line), as well as describe the different output files and their use using wheat case studies.