Sequence Read Archive (SRA) Data Working Group

Background

The National Center for Biotechnology Information (NCBI), part of the National Library of Medicine, hosts one of NIH's largest and most diverse datasets, the Sequence Read Archive (SRA). The SRA is a broad collection of experimental DNA and RNA sequences that represent genome diversity across the tree of life.

The SRA was moved to Google Cloud Platform (GCP) and Amazon Web Service (AWS) cloud services in 2019 as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. In 2019 the SRA held 9 million records in two formats. The original format (23 petabytes) is received by NCBI from submitters and is instrument- and experiment-specific; these data were traditionally stored to tape. NCBI transforms these original format data into standard SRA normalized format (12.7 petabytes) for redistribution. The normalized format contains base quality scores (BQS) that provide information about the quality of individual sequences within a dataset; however, because of the number of possible BQS for each base, they drastically increase file size, thus making BQS the largest cost driver for SRA storage in the cloud. Normalized format data is projected to grow to 33 petabytes by 2023; at this rate, the size of SRA will quickly exceed the NIH budget for storage and maintenance. New data compression techniques and/or storage models are necessary to increase the financial feasibility of maintaining this database, while also maximizing its value to the research community, well into the future.

SRA Data Working Group of the NIH Council of Councils

The NIH previously engaged the SRA Data Working Group of the NIH Council of Councils to provide input on how to address the long-standing challenge of ensuring SRA’s sustainability as an archive of exponentially growing experimental data. The SRA Data Working Group provided recommendations to the Council of Councils around identifying and evaluating solutions to maintain efficiency in the storage footprint of SRA, specifically relevant to the use of BQS and format compression strategies. They considered new file formats and availability of SRA data both "hot" and "cold" storage in the cloud. "Hot” storage provides immediate access to data and is the default standard. Data in “cold” storage is not immediately accessible but can be stored at a reduced cost with a charge to “thaw” (move) data from cold to hot storage. In summary, the Working Group recommended a new model for SRA data storage and retrieval in the cloud that involved maintaining two versions of SRA normalized format data: one with BQS and one without them, where the storage location and accessibility of data subsets would be optimized to balance cost with usage frequency. They requested NCBI analyze data usage and tailor the storage plan accordingly, while monitoring use to prevent accidental massive overuse of NIH compute resources. The working group also recommended that cost models for cloud storage and compute be provided in clear language to the research community.

Original Charge (September 2019)

The charge of the SRA Data Working Group of the Council of Councils is to provide recommendations to the Council on key factors for storing, managing, and accessing SRA data on cloud service provider environments. As the initial priority, the NIH is requesting the WG to evaluate and identify solutions to maintain efficiencies in the storage footprint of SRA, specifically evaluate the use of Base Quality Scores and format compression strategies.
Over a longer timeframe the working group may be asked to evaluate other issues, including but not limited to:

Analysis of SRA and SRA services
Technical recommendations on SRA improvements and efficiencies
Recommendations on data retention, data models and/or data usage
Vision for future needs or opportunities, as these related to SRA

The SRA Data Working Group is currently examining data analyses of SRA related to access, cost, and usage, as well as other areas. The SRA working group is using these analyses, among other factors and considerations, to evaluate and deliberate data storage options. The group reports to the Council of Councils, and will provide findings and draft recommendations on an ongoing bases.

New Charge (September 2020)

The charge of the SRA Data Working Group of the Council of Councils is to provide recommendations to the Council regarding evaluation of SRA data storage, management, and access in cloud service provider environments. The WG will also continue to provide feedback as NIH monitors the effectiveness of strategies for SRA through data collection and analysis of the solutions implemented to maintain efficiencies in the storage footprint of SRA.

The working group will focus on evaluation of SRA as a resource and other related issues, including but not limited to:

Analysis and evaluation of strategies for/changes to SRA data storage, management, and access, including impact for the biomedical research community
Recommendations on data retention, data models and/or data usage that will keep costs to NIH within sustainable levels while maintaining community access to this large public data resource
Vision for future needs or opportunities, including sustaining SRA as a community resource.

The SRA Data Working Group will examine data related to SRA scientific impact, value to the community, access, cost, and usage, as well as other areas, to inform their considerations and evaluations. The group reports to the Council of Councils and will provide findings and draft recommendations on an ongoing basis.

Roster

Co-Chairs

Susan Gregurick, Ph.D.
Associate Director for Data Science
Office of the Director
National Institutes of Health
2 Center Drive, Room 1W21, Building 2
Bethesda, MD 20852
Phone: (301) 827-7616
Email: [email protected]

Kevin B. Johnson, M.D., M.S.
Cornelius Vanderbilt Chair and Professor
Department of Biomedical Informatics
Vanderbilt University Medical Center
2525 West End Avenue, Suite 1475
Nashville, TN 37203-8556
Phone: (615) 936-6867
Fax: (615) 936-0102
Email: [email protected]
Assistant: Teresa Gillespie
Email: [email protected]

Members

Kristin Ardlie, Ph.D.
Director, Genotype Tissue Expression (GTEX LDACC)
Broad Institute of MIT and Harvard University
75 Ames Street
Cambridge, MA 02141
Phone: (617) 714-7901
Email: [email protected]

Toby Bloom, Ph.D.
Deputy Scientific Director, Informatics
New York Genome Center
101 6th Avenue
New York, NY 10013
Email: [email protected]

Rob Edwards, Ph.D.
Professor of Bioinformatics
San Diego State University
5500 Campanile Drive
Office PS 123
San Diego, CA 92182
Phone: (619) 594-1672
Email: [email protected]

Rick Horwitz, Ph.D.
Executive Director
Allen Institute for Cell Science
615 Westlake Avenue North
Seattle, WA 98109
Phone: (206) 548-8503
Email: [email protected]
Assistant: Karen Pabillon-Green
Phone: (206) 548-8536
Email: [email protected]

Hyun Min Kang, Ph.D.
Associate Professor
University of Michigan School of Public Health
M4623 SPH I Tower
1415 Washington Heights
Ann Arbor, MI 48109
Phone: (734) 647-1980
Email: [email protected]

Debbie Nickerson, Ph.D.
Professor of Genome Sciences
Adjunct Professor of Bioengineering
University of Washington
Foege S-213B
P.O. Box 355065
Seattle, WA 98195
Phone: (206) 685-7387
Email: [email protected]

Jinghui Zhang, Ph.D.
Chair, Department of Computional Biology
St. Jude Graduate School of Biomedical Sciences
St. Jude Children’s Research Hospital
MS 1135, Room IA6038
262 Danny Thomas Place
Memphis, TN 38105-3678
Phone: (901) 595-7069
Email: [email protected]