Machine learning approaches to HMM parameter estimation produced highly successful ab initio
gene prediction tools for small and medium size eukaryotic genomes [1,2] (Lomsadze, 2005, Ter-Hovhannisyan, 2008). In recent years, cost reductions in sequencing have enabled the production of large quantities of transcript (EST and RNA-seq) sequences in parallel with accumulation of genomic sequence in new genome sequencing projects. We present a new algorithm and computational pipeline for the “hybrid” gene prediction: combining machine learning techniques, iterative unsupervised HMM parameter estimation, with restrictions imposed by external information (in the form of EST and RNA-seq sequences mapped to the genome). A novel blending mechanism is introduced to combine transcript derived models with models derived by unsupervised training. Blending parameters are chosen in order to maximize a performance measure over a development set, adopting a semi-discriminative training approach. We have assessed gene prediction accuracy against test sets to evaluate the new method for several novel eukaryotic genomes, and find that the new pipeline outperforms GeneMark-ES.
1) Lomsadze, Ter-Hovhannisyan V., Chernoff Y. and Borodovsky M. (2005). "Gene identification in novel eukaryotic genomes by self-training algorithm." Nucleic Acids Research 33: 6494-6506
2) V. Ter-Hovhannisyan, Lomsadze A., Chernoff Y. and Borodovsky M. (2008). "Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training." Genome Research (Dec 18 (12)): 1979-1990.