Cancer is a disease of two genomes—the host, or germline genome, and the tumor, or somatic genome—and extensive data have been generated from both genomes. Most studies of cancer risk focus on the germline genome to identify potential cancer risk variants in genome-wide association studies (GWAS). More recently, whole genome or whole exome approaches have been used and hundreds of variants associated with cancer risk have been identified to date. However, the variants thus far identified do not fully explain the heritability of cancer. Similarly, The Cancer Genome Atlas (TCGA) program and the International Cancer Genome Consortium have generated comprehensive molecular profiles of more than 50 tumor types, representing 25,000 tumor genomes with matched normal genome data. This wealth of data is being extensively mined to fully understand driver events responsible for tumor growth.
From the point of view of better understanding the genetic/heritable component of cancer risk, integrating somatic and germline data has been useful for 1) aiding the functional analysis of variants identified in genetic association studies, and 2) using somatic molecular data to re-assess our understanding of cancer types.
Most cancer risk variants identified by genetic association studies are located in unannotated or noncoding regions of the genome, which makes understanding their function a challenge. Using TCGA gene expression data, investigators have compared expression patterns in tumor versus normal tissue (which provides information about the germline genome) and have detected changes in gene expression in some genomic regions where risk variants are located. They hypothesize that some variants may cause changes in gene expression, i.e., they function as expression quantitative trait loci (eQTLs), thus providing a link between germline risk variants and events that occur in the tumor itself. These efforts have identified genes differentially regulated by risk variants, several of which were not previously known to be involved in cancer. Similar comparisons have found that risk-associated genotypes identified by GWAS may affect methylation patterns that also alter gene expression.
Tumor molecular data can define tumor type more precisely than histological categories. Incorporating this data into studies of genetic risk may help reduce phenotypic heterogeneity among tumors currently defined as a single histological type, and thus may facilitate discovery of germline variants that increase risk for a specific cancer subtype. For example, a meta-analysis of estrogen receptor-negative breast tumors has identified germline variants associated with risk for this specific breast cancer subtype. Emerging data from genomic profiling of tumor tissue has revealed that certain types of breast and ovarian cancer may be more similar at the molecular level than previously thought, and this information may help clarify why certain genetic risk variants associate with several different cancer types. Given the relatively recent generation of complete tumor profiles, incorporation of tumor data into association studies is in its early stages. As this research progresses, it will be interesting to see whether specific germline variants associate with specific tumor subtypes, and whether working with a more homogeneous tumor phenotype, as defined by molecular profiling data, will increase the power of association studies to identify rarer, novel variants.
Integrating germline and somatic genomic data will involve working with large and complex data types, including GWAS and whole genome and whole exome sequencing data; it will also require studies with large numbers of samples and the participation of investigators with a broad range of expertise. The National Institutes of Health Big Data to Knowledge (BD2K) initiative may help address issues related to complex data analysis. NCI’s Cohort Consortium and GAME-ON initiative offer examples of productive collaboration involving large numbers of investigators with expertise in a range of fields, including genetic epidemiology, molecular biology, and clinical sciences. The Database of Genotypes and Phenotypes (dbGaP) provides controlled access to existing datasets that contain both germline and somatic data. For further information on tools and other resources available for cancer genomic research, please visit EGRP’s Genomic Resources for Cancer Epidemiology web page.
We Want to Hear From You
The Epidemiology and Genomics Research Program (EGRP) in NCI’s Division of Cancer Control and Population Sciences is interested in hearing about opportunities and challenges faced by investigators undertaking research that integrates data from the germline and tumor genomes, any resources that may be particularly helpful, and ways in which EGRP could facilitate this research.
Stefanie Nelson, Ph.D., is a Program Director in EGRP’s Host Susceptibility Factors Branch. She is responsible for developing and managing a portfolio of grants that focuses on host factors affecting cancer risk.