Sequence Read Archive (SRA) Data Working Group

Background

Sustainability of next-generation sequencing data represents a longstanding challenge for the NIH.  Sequence Read Archive (SRA) is the National Center for Biotechnology Information (NCBI) database which stores sequence data obtained from next generation sequence (NGS) technology. Released in 2009, the SRA contains 9 million records and 12 petabytes of data. The sequences capture all available information, including metagenomic and environmental sequences, and are contained in prohibitively large datasets.

Through this database, researchers can search metadata for those sequences to locate the sequence reads for further analyses. The SRA is accessed by 1,500 organizations daily, and 20 percent of its use comes from cloud users accessing through Google and Amazon Web Services. Both recent and historical datasets are retrieved consistently, although certain datasets are accessed more frequently.
Specifically, SRA:

  • Archives raw oversampling NGS data for various organisms from several platforms
  • As a member of the International Nucleotide Sequence Database Collaboration (INSDC), shares submitted public access NGS data with other members European Molecular Biology Laboratory and DNA Data Bank of Japan
  • Serves as a starting point for “secondary analyses”
  • Provides access to data from human clinical samples to authorized users who agree to the dataset’s privacy and usage mandates

Currently, the SRA supports dual use through a hybrid storage model. Two versions of the data exist: the original (raw) submission, and a normalized (extract, transform, load [ETL]) version. The raw file is necessary for reproducibility, and the ETL file is necessary for discovery (e.g., meta-analysis and search function).

As sequencing capabilities increase, SRA is growing at a rate that will be unsustainable to maintain in its current format. Data compression will be necessary to increase economic feasibility and maintain this database, while also maximizing its value to the research community, well into the future.

Charge

To address this and other issues, the charge of the SRA Data Working Group of the Council of Councils is to provide recommendations to the Council on key factors for storing, managing, and accessing SRA data on cloud service provider environments. As the initial priority, the NIH is requesting the WG to evaluate and identify solutions to maintain efficiencies in the storage footprint of SRA, specifically evaluate the use of Base Quality Scores and format compression strategies.
Over a longer timeframe the working group may be asked to evaluate other issues, including but not limited to:

  • Analysis of SRA and SRA services
  • Technical recommendations on SRA improvements and efficiencies
  • Recommendations on data retention, data models and/or data usage
  • Vision for future needs or opportunities, as these related to SRA

The SRA Data Working Group is currently examining data analyses of SRA related to access, cost, and usage, as well as  other areas. The SRA working group is using these analyses, among other factors and considerations, to evaluate and deliberate data storage options. The group reports to the Council of Councils, and will provide findings and draft recommendations on an ongoing bases.

SRA Data Working Group Roster

This page last reviewed on November 26, 2019