Skip to Content
🧬 BioQuery is in beta. We'd love your feedback!
GuidesData Sources

Data Sources

BioQuery integrates six major cancer genomics databases, all accessible through natural language queries.

Overview

SourceDescriptionSamplesData Types
TCGAThe Cancer Genome Atlas11,000+Expression, Mutations, CNV, Survival
TARGETPediatric cancers6,000+Expression, Mutations, Clinical
GTExNormal tissue reference17,000+Expression
CCLECancer cell lines~1,000Expression, Mutations
CPTACProteomics~1,500Protein abundance
GENIEReal-world clinical40,000+Clinical data

All data is accessed via ISB-CGC BigQuery  for fast, reproducible queries.


TCGA (The Cancer Genome Atlas)

The primary data source for adult cancer genomics.

Available Cancer Types (33 types)

CodeCancer TypeSamples
BRCABreast invasive carcinoma~1,100
LUADLung adenocarcinoma~600
LUSCLung squamous cell carcinoma~500
KIRCKidney renal clear cell carcinoma~530
KIRPKidney renal papillary cell carcinoma~290
GBMGlioblastoma multiforme~170
OVOvarian serous cystadenocarcinoma~300
COADColon adenocarcinoma~450
STADStomach adenocarcinoma~400
PRADProstate adenocarcinoma~500
LIHCLiver hepatocellular carcinoma~370
PAADPancreatic adenocarcinoma~180
SKCMSkin cutaneous melanoma~470
HNSCHead and neck squamous cell carcinoma~520
BLCABladder urothelial carcinoma~400
THCAThyroid carcinoma~500
UCECUterine corpus endometrial carcinoma~550

View all 33 cancer types 

Data Types

  • RNA-seq expression: TPM-normalized, hg38 aligned
  • Somatic mutations: MC3 mutation calls
  • Clinical/survival: Overall survival, progression-free survival

Example Queries

Is EGFR expression higher in LUAD vs LUSC? What's the TP53 mutation rate in breast cancer? Does high DDR1 predict worse survival in kidney cancer?

TARGET (Therapeutically Applicable Research to Generate Effective Treatments)

Pediatric cancer genomics data.

Available Cancer Types

CodeCancer TypeSamples
ALLAcute lymphoblastic leukemia~1,500
AMLAcute myeloid leukemia~200
NBLNeuroblastoma~150
OSOsteosarcoma~100
WTWilms tumor~120
RTRhabdoid tumor~50
CCSKClear cell sarcoma of kidney~20

Common Synonyms

  • “childhood leukemia” or “pediatric leukemia” → ALL
  • “neuroblastoma” → NBL
  • “Wilms tumor” or “nephroblastoma” → WT
  • “osteosarcoma” or “bone cancer in children” → OS

Example Queries

What is MYC expression in neuroblastoma? Is MYCN amplified in neuroblastoma? Compare TP53 in pediatric AML vs ALL

TARGET data uses the same RNA-seq pipeline as TCGA, so expression values are comparable.


GTEx (Genotype-Tissue Expression)

Normal tissue expression reference for tumor vs normal comparisons.

Available Tissues (54 tissues)

Common tissues used for comparison:

TissueTCGA Comparison
Breast - Mammary TissueBRCA
LungLUAD, LUSC
Kidney - CortexKIRC, KIRP
LiverLIHC
Colon - TransverseCOAD
ProstatePRAD
ThyroidTHCA
Brain - CortexGBM, LGG

Example Queries

Is MYC upregulated in breast cancer compared to normal? Compare DDR1 expression in lung cancer vs normal tissue Is TP53 higher in tumors vs normal liver?

GTEx samples are from healthy donors, while TCGA “matched normal” samples come from tissue adjacent to tumors. Results may differ.


CCLE (Cancer Cell Line Encyclopedia)

Gene expression and mutation data for ~1,000 cancer cell lines.

Available Primary Sites

SiteCell Lines
Lung173
Haematopoietic/lymphoid167
Breast62
Large intestine58
Skin54
Central nervous system46
Ovary46
Pancreas41
Stomach37

Data Types

  • Expression: RMA-normalized microarray (Affymetrix)
  • Mutations: Whole-exome sequencing

Example Queries

What is DDR1 expression in lung cancer cell lines? Which cell line sites have highest EGFR expression? Is BRAF mutated in melanoma cell lines?

CCLE expression uses RMA normalization, not TPM. Values are not directly comparable to TCGA RNA-seq data.


CPTAC (Clinical Proteomic Tumor Analysis Consortium)

Mass spectrometry-based proteomics data.

Available Cancer Types

CodeCancer TypeSamples
CCRCCClear cell renal cell carcinoma~110
GBMGlioblastoma~100
HNSCCHead and neck squamous cell carcinoma~110
LSCCLung squamous cell carcinoma~110
LUADLung adenocarcinoma~110
PDAPancreatic ductal adenocarcinoma~140
UCECUterine corpus endometrial carcinoma~100
BRCABreast cancer~120
COADColon adenocarcinoma~110
OVOvarian cancer~80

Data Types

  • Protein abundance: Log2 ratio vs pooled reference
  • Phosphoproteomics: Phosphosite quantification (available for most cancer types)

Example Queries

What is TP53 protein abundance in glioblastoma? Compare DDR1 protein levels across CPTAC cancers Does DDR1 mRNA correlate with protein in breast cancer?

Protein abundance is reported as log2 ratios relative to a pooled reference, not absolute concentrations.


GENIE (Genomics Evidence Neoplasia Information Exchange)

Real-world clinical genomics data from multiple cancer centers.

Overview

  • Patients: ~40,000+
  • Institutions: DFCI, MSK, VICC, JHU, and others
  • Data: Clinical characteristics, demographics

Available Cancer Sites

SitePatients
Lung~7,000
Breast~5,000
Colon~4,500
CNS~2,700
Skin~2,000
Ovary~1,800
Pancreas~1,600
Prostate~1,500

Example Queries

What is the age distribution in lung cancer patients? Clinical summary for breast cancer patients

GENIE currently provides clinical/demographic data. Genomic data (mutations, expression) coming soon.


Data Access

All BioQuery data is accessed via the ISB-CGC BigQuery  public datasets. This ensures:

  • Speed: Sub-second query execution on terabytes of data
  • Reproducibility: Exact SQL queries provided for every analysis
  • Transparency: All tables and versions documented

BigQuery Tables Used

# TCGA isb-cgc-bq.TCGA.RNAseq_hg38_gdc_current isb-cgc-bq.TCGA.masked_somatic_mutation_hg38_gdc_current isb-cgc-bq.TCGA.clinical_gdc_current # TARGET isb-cgc-bq.TARGET.RNAseq_hg38_gdc_current isb-cgc-bq.TARGET.masked_somatic_mutation_hg38_gdc_current # GTEx isb-cgc.GTEx_v7.gene_median_tpm # CCLE isb-cgc-bq.CCLE.RMA_expression_hg19_current isb-cgc-bq.CCLE.somatic_mutation_hg19_current # CPTAC isb-cgc-bq.CPTAC.quant_proteome_* # GENIE isb-cgc-bq.GENIE.clinical_gdc_current