Eukaryotic genome annotation pipeline

Thibaud-Nissen, Francoise

The NCBI Eukaryotic annotation pipeline provides content for various NCBI resources including sequence and BLAST databases, Gene and the MapViewer genome browser. In recent years, the pipeline has been modernized to run efficiently with minimal human involvement. In the first 10 months of 2011 alone, the genomes of 22 organisms were annotated. The pipeline uses a modular framework for the execution of all annotation tasks from the fetching of raw and curated data from public repositories (sequence and Assembly databases) to the alignments of sequences and gene prediction, to the submission of the accessioned annotation products to public databases. Core components of the pipeline are alignment programs (Splign and ProSplign) and an HMM-based prediction program (Gnomon) developed at NCBI. Important features of the pipeline include its flexibility and speed, the tracking of gene loci from one annotation to the next, the possibility to annotate in coordination multiple assemblies for the same organism, the different weight given to curated evidence and non-curated evidence, and the production of models that compensate for assembly issues. We will describe the annotation pipeline dataflow and inputs, including how we use 454 RNA sequence data and ongoing development efforts on using shorter RNA-seq data, quality assessment and annotation products. We will present the NCBI priorities and interests regarding annotation and describe how the integration of annotation and RefSeq curation provides a current, maintained, quality annotation product.

W456 Eukaryotic genome annotation pipeline