A New Statistical Method for Gene Discovery From Large-Scale Gene Expression Data with Next-Generation Sequencing Technology

Igarashi, Kaori

High-throughput RNA sequencing technologies with next generation sequencers (NGS) facilitate identification of novel genes and their biological functions. The large-scale gene expression data from NGS leads to more detailed classifications of genes according to expression profiles. However, the current method, such as hierarchical clustering, for a large-scale dataset requires long calculation time and large-scale computer system. To overcome this problem, we have developed a statistical method based on correspondence analysis (CA). For even large-scale data, it permits us to quickly classify genes in a personal computer. We employed Arabidopsis Illumina sequences from NCBI SRA to show the theoretical advantage of this method in classification of genes according to expression profiles. The sequence datasets were generated from samples under stresses of high-light, heat, cold, salt, drought and a non-stressful condition (total 203,735,588 reads of 35 bp length). To obtain digital gene expression profiles, the number of reads mapped to each gene (TAIR10 cDNAs) was used as the expression level. Classification of genes with the new method showed that around 300 genes were significantly up-regulated genes under both salt and drought stresses. Our method allows us to efficiently detect genes related to phenotypes or biological conditions of interest. The calculation for the large-scale expression data matrix presented here takes within 1 minute with a Windows PC [4GB memory, Core(TM)2 Duo]. We have also developed a GUI software package “CA Plot Viewer” which can easily be applied to users' own expression data.

P0985 A New Statistical Method for Gene Discovery From Large-Scale Gene Expression Data with Next-Generation Sequencing Technology