W196 Proteogenomics: Discovering Novel Genes Using Mass Spectrometry

Date: Tuesday, January 17, 2012
Time: 4:50 PM
Room: Royal Palm Salon 1,2,3
Natalie Castellana , University of California San Diego, La Jolla, CA
The identification of all protein-coding elements in a genome is a fundamental goal of gene annotation. Computational gene predictions remain inaccurate, even with extrinsic information from cDNA and EST libraries or related genomes. Tandem mass spectrometry (MS) has become the dominant vehicle for identifying and quantifying proteins.   In this work, we use MS to identify novel protein-coding genes in the genome, and improve the annotation for Zea mays. In a recent study, we collected 21 million MS/MS spectra from Arabidopsis thaliana from multiple organs and multiple fractionation techniques. We developed a pipeline for integrating the MS/MS data with other evidence to correct 339 gene models in the TAIR8 gene annotation. The core of our method relies on two expanded databases of putative proteins, both the linear six-frame translation and a compact graph representation of all putative spliced-exon pairs generated by AUGUSTUS.  This enables us to discover novel coding sequences and novel splice sites. Our pipeline is fully automated, and has been expanded to perform gene annotation on any organism, including the recently sequenced Zea mays, whose genome is comparable in size to human.