From Hosts to Symbionts: Integrated Gene Prediction for Prokaryotic Genomes using EuGene

Schiex, Thomas

With the advent of new generation sequencing, the annotation of new prokaryotic (symbiont) genomic sequences increasingly occurs in a data-rich context, including a variety of libraries of short reads of transcriptomic sequences. This creates new opportunities in annotation. In this talk, I will introduce the new prokaryotic variant of the integrative gene prediction software EuGene used for plan annotation. By leveraging RNA-Seq data, EuGene becomes capable of predicting new functional structures, including untranslated transcribed regions such as non protein coding RNA genes. Following the initial development of gene prediction tools for prokaryotic genomes, the complexity of eukaryotic gene prediction led to the development of highly integrative gene prediction tools. Very few, if any, prokaryotic gene prediction tools have evolved along the same line, mostly because prokaryotic protein gene structures are simple and defined by open reading frames. Through RNA-Seq data, new generation sequencing gives unprecedented access to the actual transcriptome. NGS technology is able to produce oriented read for which the strand of transcription is known. Such data enables the automatic prediction of a variety of transcribed elements, including protein genes, but also (possibly antisense) ncRNA genes. EuGene is an eukaryotic gene finder that can be described as a Conditional Random Field predictor. The default gene model of EuGene includes intergenic regions, coding exons, introns, 5’/3’ untranslated terminal regions and introns within UTRs. To be able to predict new functional elements in prokaryotes, the gene model underlying EuGene has been modified to capture untranslated regions, overlapping genes but also to integrate information from prokaryotic gene finders such as RBS/ribosomal RNA hybridation energy. Oriented RNA-Seq data can be either directly integrated inside EuGène or following a segmentation based on the level of transcription. In the simplest variant, partial transcripts defined by oriented short reads are mapped to the genome. Their abundance at a given position indicates that the current region is transcribed on the corresponding strand, sudden changes in expression level in one or several conditions also indicates possible transcription start sites. By integrating translation/transcription start and stop prediction, statistical models of different regions (especially coding regions) and RNA-Seq data inside a unique tool, EuGene becomes capable of discriminating protein genes (which are transcribed and follow a coding region statistical model) from ncRNA genes (which are transcribed but do not follow a coding model). The resulting gene finder has been used to annotate the symbiont Sinorhizobium meliloti using 48Gb of oriented RNA-seq data obtained by sequencing seven libraries. The results we have obtained closely match the existing expert genome annotation but also contain ribosomal and transfer RNA genes and many potentially new RNA genes. References 1. T. Schiex, A. Moisan, P. Rouzé, Eugène: an eukaryotic gene finder that combines several sources of evidence. In Selected papers from JOBIM’2000, volume 2066 of LNCS, pages 118–133. Springer Verlag, 2001. 2. S. Foissac et al. Genome annotation in plants and fungi: EuGene as a model platform. Current Bioinformatics, 3(2):87–97, 2008.

W193 From Hosts to Symbionts: Integrated Gene Prediction for Prokaryotic Genomes using EuGene