For researchers/ scientists interested in finding data about cancer genomics, Genomic Data Commons (GDC) is a good place to start. This article presents you with a bird’s eye view of the data present in this repository and provides relevant links to navigate this vast resource with ease. A highly curated version of data hosted on GDC can be accessed through Polly’s GDC Omixatlas.
What is Genomic Data Commons?
The Genomic Data Commons (GDC) is a research program of the National Cancer Institute (NCI). The mission of the GDC is to provide the cancer research community with a unified repository and cancer knowledge base that enables data sharing across cancer genomic studies in support of precision medicine. The GDC contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). Cancer is essentially a disease of the genome, caused by changes in the DNA, RNA, and proteins of a cell that push cell division into overdrive. Identifying the genomic alterations that arise in cancer can help researchers decode how cancer develops and improve upon the diagnosis and treatment of cancers based on their distinct molecular abnormalities. Through the GDC knowledge base, researchers can leverage data maintained in the GDC to assist in identifying both high- and low-frequency cancer drivers such as mutations, copy number variants, expression quantifications, post-translational modifications, etc.
What kind of data can one find in GDC?
The GDC provides researchers with access to standardized clinical and genomic data from cancer studies to enable exploratory analysis. It employs a state-of-the-art data model (called Genomic Data Model) made of two components:
- Genomic data [in Browser Extensible Data (BED) format]
- Related metadata [in tab-delimited key-value format]
Furthermore, the GDC genomic data has been extended with information extracted from other public genomic databases (e.g., GENCODE, HGNC, and miRBase) as well.
How to access data on GDC?
Data in the GDC can be accessed through the user‑friendly web‑based GDC Data Portal, which enables browsing, querying, and downloading of data and metadata. In addition, the GDC provides a command-line tool for downloading large volumes of data and an application programming interface (API) for programmatic access to GDC functionality.
Some data in the GDC is open access, which means that no authentication or authorization is necessary to access it. Other data is controlled access, which means that dbGaP authorization and eRA Commons authentication is necessary for access. To download controlled-access data, users must log in to eRA Commons and have access to the data through dbGaP. No login is required when accessing open access data.
How to Download Data from GDC?
The GDC provides several resources for querying and downloading data:
- For querying and downloading GDC data files: GDC Data Portal
- For downloading large volumes of files: GDC Data Transfer Tool
- For performing programmatic queries and downloads: GDC Application Programming Interface (API)
How to cite GDC?
Authors who use data from GDC are requested to credit the NCI Genomic Data Commons (GDC) in the manuscript by citing the following paper about the GDC:
‘Grossman, Robert L., Heath, Allison P., Ferretti, Vincent, Varmus, Harold E., Lowy, Douglas R., Kibbe, Warren A., Staudt, Louis M. (2016) Toward a Shared Vision for Cancer Genomic Data. New England Journal of Medicine375:12, 1109-1112.’
Please refer to the attribution policies of the project when available while citing individual projects.
To know more about accessing/ analyzing highly curated data sourced from GDC and many other repositories on Polly, schedule a meeting by writing to us at email@example.com