Sequence Read Archive (SRA) Data Working Group

Background

The National Center for Biotechnology Information (NCBI), part of the National Library of Medicine, hosts one of NIH's largest and most diverse datasets, the Sequence Read Archive (SRA). The SRA is a broad collection of experimental DNA and RNA sequences that represent genome diversity across the tree of life.

The SRA was moved to Google Cloud Platform (GCP) and Amazon Web Service (AWS) cloud services in 2019 as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. In 2019 the SRA held 9 million records in two formats. The original format (23 petabytes) is received by NCBI from submitters and is instrument- and experiment-specific; these data were traditionally stored to tape. NCBI transforms these original format data into standard SRA normalized format (12.7 petabytes) for redistribution. The normalized format contains base quality scores (BQS) that provide information about the quality of individual sequences within a dataset; however, because of the number of possible BQS for each base, they drastically increase file size, thus making BQS the largest cost driver for SRA storage in the cloud. Normalized format data is projected to grow to 33 petabytes by 2023; at this rate, the size of SRA will quickly exceed the NIH budget for storage and maintenance. New data compression techniques and/or storage models are necessary to increase the financial feasibility of maintaining this database, while also maximizing its value to the research community, well into the future.

SRA Data Working Group of the NIH Council of Councils

The NIH previously engaged the SRA Data Working Group of the NIH Council of Councils to provide input on how to address the long-standing challenge of ensuring SRA’s sustainability as an archive of exponentially growing experimental data. The SRA Data Working Group provided recommendations to the Council of Councils around identifying and evaluating solutions to maintain efficiency in the storage footprint of SRA, specifically relevant to the use of BQS and format compression strategies. They considered new file formats and availability of SRA data both "hot" and "cold" storage in the cloud. "Hot” storage provides immediate access to data and is the default standard. Data in “cold” storage is not immediately accessible but can be stored at a reduced cost with a charge to “thaw” (move) data from cold to hot storage. In summary, the Working Group recommended a new model for SRA data storage and retrieval in the cloud that involved maintaining two versions of SRA normalized format data: one with BQS and one without them, where the storage location and accessibility of data subsets would be optimized to balance cost with usage frequency. They requested NCBI analyze data usage and tailor the storage plan accordingly, while monitoring use to prevent accidental massive overuse of NIH compute resources. The working group also recommended that cost models for cloud storage and compute be provided in clear language to the research community.

Original Charge (September 2019)

The charge of the SRA Data Working Group of the Council of Councils is to provide recommendations to the Council on key factors for storing, managing, and accessing SRA data on cloud service provider environments. As the initial priority, the NIH is requesting the WG to evaluate and identify solutions to maintain efficiencies in the storage footprint of SRA, specifically evaluate the use of Base Quality Scores and format compression strategies.
Over a longer timeframe the working group may be asked to evaluate other issues, including but not limited to:

  • Analysis of SRA and SRA services
  • Technical recommendations on SRA improvements and efficiencies
  • Recommendations on data retention, data models and/or data usage
  • Vision for future needs or opportunities, as these related to SRA

The SRA Data Working Group is currently examining data analyses of SRA related to access, cost, and usage, as well as  other areas. The SRA working group is using these analyses, among other factors and considerations, to evaluate and deliberate data storage options. The group reports to the Council of Councils, and will provide findings and draft recommendations on an ongoing bases.

https://dpcpsi.nih.gov/council/sradwg/roster

New Charge (September 2020)

The charge of the SRA Data Working Group of the Council of Councils is to provide recommendations to the Council regarding evaluation of SRA data storage, management, and access in cloud service provider environments. The WG will also continue to provide feedback as NIH monitors the effectiveness of strategies for SRA through data collection and analysis of the solutions implemented to maintain efficiencies in the storage footprint of SRA.

The working group will focus on evaluation of SRA as a resource and other related issues, including but not limited to:

  • Analysis and evaluation of strategies for/changes to SRA data storage, management, and access, including impact for the biomedical research community
  • Recommendations on data retention, data models and/or data usage that will keep costs to NIH within sustainable levels while maintaining community access to this large public data resource
  • Vision for future needs or opportunities, including sustaining SRA as a community resource.

The SRA Data Working Group will examine data related to SRA scientific impact, value to the community, access, cost, and usage, as well as other areas, to inform their considerations and evaluations. The group reports to the Council of Councils and will provide findings and draft recommendations on an ongoing basis.

This page last reviewed on October 9, 2020