Genome-scale Protein Function Prediction using Phylogenomics, Data Integration and Lexical Scoring, applied on the genomes of tomato (Solanum lycopersicum) and the leguminous plant Medicago truncatula

Hallab, Asis

Protein function has often been transferred from characterized proteins to novel proteins based on sequence similarity, e.g. using the best BLAST hit. Based on the SIFTER phylogenomic tool (1), we use a statistical inference algorithm to propagate e.g. Gene Ontology (GO) terms inside a phylogenetic tree, scoring branch length, evidence codes of GO annotations and speciation vs. duplication nodes. Including additional information like Interpro domains improves predictions. This allows us to integrate multiple data types in a consistent framework. In order to generate accurate phylogenetic trees that contain a maximum of functional information at reasonable computational costs, we implemented a reusable workflow that, for a given input protein, searches candidate orthologs with known functions, adds paralogs so that duplications can be detected reliably and builds a phylogenetic tree from a filtered multiple alignment. This tree is then used as input to the inference algorithm which outputs, for each protein in the tree, a probability for assigning each GO term occurring in the tree. We call this new phylogenomic workflow for protein function prediction PhyloFun. We integrated it in AFAWE, a tool for the automatic and manual annotation of genes, and used it in the Medicago truncatula and tomato genome projects. AFAWE provides a web interface for functional information, displaying PhyloFun along with Interpro and BLAST results (see bioinfo.mpipz.mpg.de). These are integrated to allow e.g. highlighting BLAST results that contain the same Interpro domains as the query or that have experimentally verified GO annotations. To assign human readable descriptions to predicted proteins we developed a new program called Automatic assignment of human readable descriptions (AHRD). We aim to select descriptions that are concise and informative, precise in regard to function and use standard nomenclature. It scores BLAST hits taken from searches against different databases on the basis of the trust put into these databases and the local alignment quality. The BLAST hit descriptions are tokenized into informative words and a lexical analysis scores these tokens according to their frequency and the quality of the BLAST hits they occur in. Shared tokens with Gene-Ontology Annotations increase the description-scoring in order to use standard nomenclature where possible. Finally the best scoring description is assigned and displayed in AFAWE.

(1) B. E. Engelhardt et al. 2005. PLoS Computational Biology 1(5):e45

P1001 Genome-scale Protein Function Prediction using Phylogenomics, Data Integration and Lexical Scoring, applied on the genomes of tomato (Solanum lycopersicum) and the leguminous plant Medicago truncatula