Unsupervised Training for ab initio Gene Finders

Lomsadze, Alexandre

NGS fueled rapid sequencing of large plant and animal genomes has mounted an unprecedented pressure on genome annotation teams struggling with tasks of consistent and accurate annotation of biological "Big Data" streaming from both DNA and RNA levels (RNA-seq). Notably, sequences can vary in size from few dozen nucleotides to megabases. Still, all cases have one question in common: what kinds of proteins the sequences are encoding, if any? Conventional ab initio algorithms find genes using statistical models with species specific parameters derived from training sets. Gene finding methods that do not need user supervision make important shortcuts in pipelines of sequence analysis and interpretation; these methods do not require accumulation of experimental or expert knowledge on the novel genomic data (such as painstaking construction of sets of reliably annotated sequences needed for estimation of algorithm parameters). Availability of gene finding methods with unsupervised training became critically important for timely analysis and annotation of novel genomes. We have started development of self-training methods about fifteen years ago in connection with gene finding in complete microbial and viral genomes; it was later extended to gene finding in eukaryotes as well as in metagenomes. Here we present several algorithms and software tools for finding protein coding genes in genomic data generated with help of NGS technologies. The key unifying feature of all the methods is elimination of the time consuming step of curated preparation of species specific training sets.

P1013 Unsupervised Training for ab initio Gene Finders