P0936 DDBJ Sequence Read Archive and a cloud-computing based annotation tool for new-generation sequencing data

Hideki Nagasaki , Center for Information Biology and DDBJ, National Institute of Genetics, ROIS, Mishima, Shizuoka, Japan
Takako Mochizuki , Center for Information Biology and DDBJ, National Institute of Genetics, ROIS, Mishima, Shizuoka, Japan
Eli Kaminuma , Center for Information Biology and DDBJ, National Institute of Genetics, ROIS, Mishima, Shizuoka, Japan
Yuichi Kodama , Center for Information Biology and DDBJ, National Institute of Genetics, ROIS, Mishima, Shizuoka, Japan
Satoshi Saruhashi , Center for Information Biology and DDBJ, National Institute of Genetics, ROIS, Mishima, Shizuoka, Japan
Asami Nozaki , Center for Information Biology and DDBJ, National Institute of Genetics, ROIS, Mishima, Shizuoka, Japan
Toshihisa Takagi , Center for Information Biology and DDBJ, National Institute of Genetics, ROIS, Mishima, Shizuoka, Japan
Kousaku Okubo , Center for Information Biology and DDBJ, National Institute of Genetics, ROIS, Mishima, Shizuoka, Japan
Yasukazu Nakamura , Center for Information Biology and DDBJ, National Institute of Genetics, ROIS, Mishima, Shizuoka, Japan
New Generation Sequencing (NGS) is an increasingly important technology in genome or molecular biology research. The NGS generates various kinds of outcomes such as re-sequencing, de novo assemble of genomes, transcriptome analysis and so on. Although the data scale for one throughput of NGS, called "data tsunami", contributes to lower running cost, almost reaches to Tera base order, the massive data size causes various troubles. Under the partnership of the International Nucleotide Sequence Database Collaboration (INSDC), which capture, preserve and present globally comprehensive public domain nucleotide sequences, the DNA Data Bank of Japan (DDBJ) has released DDBJ Sequence Read Archive (DRA), an archive database for NGS data. DRA saves several NGS data problems: saving local data storage by storing public database; easy metadata submission by Flash based tool, which named 'MetaDefine'. On the other hand, the increase of data scale also makes researchers the difficulties of data analysis. To resolve the problem, we developed the DDBJ Read Annotation Pipeline, a cloud computing-based analytical tool for massive sequencing reads using DDBJ supercomputer resources. The pipeline consists of two processes: basic analysis for genome mapping and de novo assembly, and high-level analysis for structural and functional annotations such as single nucleotide polymorphism (SNP) detection and expression tag counts. To accomplish basic analysis functions, we installed popular mapping and assembly tools including bowtie, SOAPdenovo and others. High-level analysis furnishes graphical user interface of Galaxy, which is a web application to construct workflows. DRA and DDBJ pipeline are available on http://trace.ddbj.nig.ac.jp/ and http://p.ddbj.nig.ac.jp/.