Ab initio Gene Finding in Compositionally Heterogeneous Eukaryotic Genomes

Lomsadze, Alexandre

Ab initio gene finding in compositionally heterogeneous genomes, such as rice, brachypodium, honey bee, etc., presents significant difficulties. Still, advent of next generation sequencing technology that led to sharp increase in number of sequenced eukaryotic genomes makes ab initio algorithms and especially ab initio algorithms with self-training very valuable tools for genome annotation. Therefore, development of algorithms robust with respect to genome inhomogeneity of different types is an important goal. We present a new algorithm for ab initio gene identification in compositionally heterogeneous genomes. The earlier developed GeneMark-ES algorithm employs unsupervised training for a generalized hidden Markov model with 55 states, the model of a compositionally homogeneous eukaryotic genome. This HMM was extended to include groups of hidden states for compositionally different exon-intron structures and intergenic sequences. Species specific parameters of the algorithm could be estimated from a sufficiently long set of unannotated sequences by unsupervised training algorithm utilizing the same extended HMM and implemented in the software program GeneMark-ES 3.0. Notably, if curated data sets (sequences with externally validated annotation) are available, the model parameters could be also derived from these training sets. We demonstrate improvement in gene recognition accuracy in application of the new algorithm to several novel eukaryotic genomes.

P1012 Ab initio Gene Finding in Compositionally Heterogeneous Eukaryotic Genomes