Gene Prediction by Hybrid ab initio and Transcript Sequence Mapping Approaches

Burns, Paul D.

Machine learning approaches to HMM parameter estimation produced highly successful ab initio gene prediction tools for small and medium size eukaryotic genomes [1,2] (Lomsadze, 2005, Ter-Hovhannisyan, 2008). In recent years, cost reductions in sequencing have enabled the production of large quantities of transcript (EST and RNA-seq) sequences in parallel with accumulation of genomic sequence in new genome sequencing projects. We present a new algorithm and computational pipeline for the “hybrid” gene prediction: combining machine learning techniques, iterative unsupervised HMM parameter estimation, with restrictions imposed by external information (in the form of EST and RNA-seq sequences mapped to the genome). A novel blending mechanism is introduced to combine transcript derived models with models derived by unsupervised training. Blending parameters are chosen in order to maximize a performance measure over a development set, adopting a semi-discriminative training approach. We have assessed gene prediction accuracy against test sets to evaluate the new method for several novel eukaryotic genomes, and find that the new pipeline outperforms GeneMark-ES.

References

1) Lomsadze, Ter-Hovhannisyan V., Chernoff Y. and Borodovsky M. (2005). "Gene identification in novel eukaryotic genomes by self-training algorithm." Nucleic Acids Research 33: 6494-6506

2) V. Ter-Hovhannisyan, Lomsadze A., Chernoff Y. and Borodovsky M. (2008). "Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training." Genome Research (Dec 18 (12)): 1979-1990.

P1014 Gene Prediction by Hybrid ab initio and Transcript Sequence Mapping Approaches