The past decade has seen a virtual explosion in the generation and analysis of genomic data. Due in part to NIH data sharing policies, much of this information is available through controlled access databases. There are a growing number of key scientific findings stemming from this shared data, but we are only at the tip of the iceberg in terms of leveraging these resources.
The National Cancer Institute’s (NCI) Epidemiology and Genomics Research Program (EGRP) is committed to promoting data sharing. As part of this commitment, NCI is continually developing resources to assist members of the research community interested in leveraging existing data for cancer research. This blog post provides an overview of how to navigate a key source for controlled access data: The database of Genotypes and Phenotypes (dbGaP).
Types of Data Housed in dbGaP
Originally, dbGaP was created as an archive of genome-wide association studies (GWAS), but as technology evolved, so did dbGaP. Currently, dbGaP comprises almost 400 datasets, with the number growing daily. The available datasets cover a wide variety of traits and phenotypes, from cancer to infectious diseases. The studies in dbGaP have used a variety of different technologies to assess germline and somatic genomic variation, including single nucleotide polymorphism (SNP) genotyping, exome sequencing, whole-genome sequencing, and RNA sequencing. There are more than 30 cancer-related datasets in dbGaP for more than 10 different types of cancer, including studies of childhood cancers, consortia studies of common cancers, and individual studies of multiple cancer types. Learn more about cancer-related databases in dbGaP.
How to Access Data in dbGaP
If you are interested in accessing dbGaP data, but aren’t sure how to get started, here are the steps to follow:
- Log in to eRA Commons
- Review the available data sets and any data use limitations
- Select the datasets for your project
- Describe how you will use the data (Research Use Statement (RUS)), and
- Submit your request to dbGaP in the form of a Data Access Request (DAR).
- This YouTube video also explains the process.
Your DAR will be reviewed by the relevant NIH Data Access Committee (DAC). The DAC’s decision to approve a DAR is based on whether or not the proposed research is in compliance with the requested datasets’ data use limitations. EGRP also has a webpage containing more details about the data access request process.
If you would like to see examples of research use statements, or if you’re curious about what others are doing with the available data, you can view research use statements from approved users on the public study page for each study currently released in dbGaP. See an EGRP webpage containing examples of research use statements and non-technical summaries.
An Evolving Resource
In just a few years, dbGaP has expanded from a database housing only a few GWAS to a large collection of hundreds of different studies. Because of its rapid growth, the process for submitting and requesting data is evolving, and dbGap is always looking for ways to improve the process. Investigators who have been granted access to datasets have the opportunity to provide feedback on dbGaP and the data access process as part of their annual progress reports.
Invitation to Share Comments
EGRP invites comments on the following subjects: (1) examples of how you have used existing data from dbGaP or other public repositories in your own research; (2) ideas for leveraging existing shared data to accelerate cancer research; and (3) other resource-sharing needs.
Connect with Us at the 2013 AACR Annual Meeting
EGRP staff will be participating in an NCI-sponsored session entitled “Advancing Scientific Progress through Genomic Data Sharing” at the 2013 AACR Annual Meeting from 2:00 – 3:30 p.m. on Sunday, April 7th. EGRP staff will also be available on April 7th from 3:30 – 4:00 p.m. at the NCI exhibit booth for a Meet-the-Experts session, “Bridging the GaP: Accessing Cancer Datasets in the database of Genotypes and Phenotypes.” We hope AACR attendees will join us for these informative discussions of this topic.
Tiffany Green, M.H.S., M.P.H. is an employee of Kelly Scientific Services and works as a Program Analyst in the Epidemiology and Genomics Research Program’s (EGRP) Host Susceptibility Factors Branch (HSFB). She supports NCI’s implementation of the NIH GWAS Data Sharing Policy and works closely with NCI’s GWAS Program Administrator, the Data Access Committee, and other EGRP staff. Ms. Green also engages in projects relating to ethical issues in public health and human subjects research and support trans-NCI bioethics activities.
Ms. Green was previously a Cancer Research Training Award Fellow in EGRP and has interned previously with the National Human Genome Research Institute, where she worked on lung cancer and prostate cancer studies. She has also worked with multiple health departments in Georgia.
Charlisse Caga-anan, J.D. is a Program Director in EGRP’s HSFB. In this capacity, she works to implement the NIH GWAS Data Sharing Policy and grow NCI’s portfolio of bioethics research grants.
Prior to joining EGRP, Ms. Caga-anan was a Postdoctoral Scholar at the Center for Genetic Research Ethics and Law, an NIH-funded Center for Excellence in Ethical, Legal, and Social Implications (ELSI) Research. In this capacity she conducted research pertaining to ELSI issues in genetics/genomics research. Ms. Caga-anan also completed the Cleveland Fellowship in Advanced Bioethics, during which she trained in clinical ethics consultation and examined ethical and legal issues in pediatrics, clinical genetics, genetics/genomics research, and human subjects research generally.
Carolyn M. Hutter, Ph.D. is a Program Director in the EGRP’s HSFB. Her responsibilities include expanding and managing a portfolio of grants that supports the development and implementation of more sophisticated methods in statistical genetics and genetic epidemiology for studies of complex diseases such as cancer. This includes a focus on the many analytic and computational challenges accompanying the application of next generation sequencing technologies to large scale epidemiology studies.
Prior to joining EGRP, Dr. Hutter was a Senior Staff Scientist in the Cancer Prevention Program at the Fred Hutchinson Cancer Research Center and a Lecturer in the Epidemiology Department at the University of Washington. Her previous research focused on the role that genetic and environmental factors play in the risk of colorectal cancer, Parkinson’s disease, and other complex phenotypes.