P0930 NCBI SRA Toolkit Technology for Next Generation Sequence Data

Steve Sherry , NIH/NLM/NCBI, Bethesda, MD
Chunlin Xiao , NIH/NLM/NCBI, Bethesda, MD
Eugene Yaschenko , NIH/NLM/NCBI, Bethesda, MD
Kenneth Durbrow , NIH/NLM/NCBI, Bethesda, MD
Michael Kimelman , NIH/NLM/NCBI, Bethesda, MD
Kurt Rodarmer , NIH/NLM/NCBI, Bethesda, MD
Martin Shumway , NIH/NLM/NCBI, Bethesda, MD
James Ostell , NIH/NLM/NCBI, Bethesda, MD
As biology continues down the path of supporting discovery with systematic sequencing efforts based on next generation technologies, NIH has directed NCBI’s Sequence Read Archive (SRA) to continue to serve as the central repository for sequence results emanating from these studies of model organisms, agricultural genomes, pathogens, microbiota and human disease.   Since its inception in 2007 the SRA has accumulated over 300 terabases of raw sequencing data from human and non-human sources, and is currently growing 20 terabases per month. The archive currently occupies 0.6 petabytes of disk storage, and reducing the footprint of the existing archive and new deposits is of critical importance to both NCBI and its users.  Using fully indexed columnar database design, the SRA toolkit has reduced lossless compression of the SRA from 32 bits per base in 2008 to 15 bits per base today and observes efficiencies under 4 bits per base with data from the latest sequencing platforms. Most importantly, SRA now also manages alignment properties of reads such as those delivered with BAM files. The spectrum of compression strategies needs to match the spectrum of popular classes of NGS applications that include: whole genome sequencing, exome resequencing,  RNA-Seq, and epigenetics.  The SRA team is currently defining these class-matched  strategies in terms of compression by reference, policies for the identification and retention of essential data types, and the evaluation of quality score minimal information content.  Since the compressibility of data both deposited and retrieved for various NGS application classes can vary by orders of magnitude, new features and a flexible data model are being introduced into the SRA toolkit to balance the need for uniform information requirements for depositors while retaining maximal utility for users.   The SRA representation and toolkit technology unify these strategies into a single user experience that can support efficient compression, slices of data on read, indexed retreival, serialization of data for pipeline processing, retention of optional BAM tags, lossy compression.and support for non-standard reference sequences. The SRA toolkit is freely distributed under multiple platforms and in unrestricted source code form, see http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software The toolkit technology is engineered for indexing, slicing, streaming, and real-time compression and encryption of datasets for optimal retrieval by SRA users.    The toolkit can also be deployed at local sites to realize the same services for internal NGS data.