Analysis Types
BioQuery supports multiple types of genomic analyses across six data sources. Learn when to use each one.
Data Sources
| Source | Description | Data Types |
|---|---|---|
| TCGA | The Cancer Genome Atlas - 33 adult cancer types | Expression, Mutations, Survival |
| TARGET | Pediatric cancers - 7 cancer types | Expression, Mutations, Clinical |
| GTEx | Normal tissue reference - 54 tissues | Expression |
| CCLE | Cancer Cell Line Encyclopedia - ~1,000 cell lines | Expression, Mutations |
| CPTAC | Clinical Proteomics - 10 cancer types | Protein abundance, Phosphoproteomics |
| GENIE | Real-world clinical - ~40,000 patients | Clinical data |
Differential Expression
Compare gene expression between two cancer types or conditions.
When to Use
- Comparing expression between cancer subtypes
- Investigating tissue-specific expression patterns
- Validating gene signatures across cancer types
Example Queries
Is DDR1 expression higher in papillary RCC vs clear cell RCC?
Compare EGFR expression between LUAD and LUSC
How does HER2 expression differ between ER+ and ER- breast cancer?Statistical Method
- Test: Wilcoxon rank-sum test (non-parametric)
- Metric: Fold change (log2 difference in medians)
- Visualization: Boxplot with individual data points
Interpreting Results
Expression values are log2(TPM+1) normalized. A fold change of 2 means the gene is expressed ~2x higher in one group.
Tumor vs Normal
Compare gene expression in tumor tissue versus matched normal tissue.
When to Use
- Identifying genes upregulated in cancer
- Finding potential therapeutic targets
- Validating known oncogenes/tumor suppressors
Example Queries
Is TP53 upregulated in breast cancer compared to normal?
What's the fold change of BRCA1 in ovarian tumors vs normal?
Is MYC overexpressed in liver cancer?Data Sources
- Tumor: TCGA tumor samples
- Normal: TCGA matched normal + GTEx normal tissue
Statistical Method
- Test: Wilcoxon rank-sum test
- Metric: Fold change and log2 difference
- Visualization: Grouped boxplot (Tumor vs Normal)
Not all cancer types have matched normal samples. Some comparisons use GTEx data as normal reference.
Mutation Frequency
Calculate how often a gene is mutated in a specific cancer type.
When to Use
- Identifying driver mutations
- Understanding mutation landscape of a cancer
- Finding potential biomarkers
Example Queries
What percentage of glioblastoma has IDH1 mutations?
How common is BRAF V600E in melanoma?
What's the TP53 mutation rate in colorectal cancer?Statistical Method
- Metric: Percentage of samples with mutation
- Data: TCGA somatic mutation calls (MC3)
- Visualization: Bar chart with confidence intervals
Mutation Types Included
| Type | Description |
|---|---|
| Missense | Amino acid change |
| Nonsense | Premature stop codon |
| Frameshift | Insertion/deletion causing frame shift |
| Splice site | Affects mRNA splicing |
Copy number alterations are analyzed separately and not included in mutation frequency calculations.
Survival Analysis
Examine how gene expression relates to patient outcomes.
When to Use
- Identifying prognostic biomarkers
- Validating therapeutic targets
- Understanding disease progression
Example Queries
Does high DDR1 expression predict worse survival in kidney cancer?
Is BRCA1 expression associated with overall survival in ovarian cancer?
Do patients with high MYC have worse prognosis in lymphoma?Statistical Method
- Test: Log-rank test
- Metric: Hazard ratio (Cox regression)
- Stratification: Median expression split (high vs low)
- Visualization: Kaplan-Meier curves
Survival Endpoints
| Endpoint | Description |
|---|---|
| Overall Survival (OS) | Time to death from any cause |
| Progression-Free Survival (PFS) | Time to disease progression or death |
| Disease-Specific Survival (DSS) | Time to death from cancer |
Interpreting Kaplan-Meier Curves
- Y-axis: Probability of survival
- X-axis: Time (usually months or years)
- Curves: One per group (high/low expression)
- Tick marks: Censored patients (lost to follow-up)
- Shading: 95% confidence interval
Survival data has limitations: follow-up time varies, some patients are censored, and treatment effects are not accounted for.
Choosing the Right Analysis
| Question Type | Analysis | Example |
|---|---|---|
| ”Is gene X higher in cancer A vs B?” | Differential Expression | ”Is EGFR higher in LUAD vs LUSC?" |
| "Is gene X upregulated in cancer?” | Tumor vs Normal | ”Is MYC upregulated in breast cancer?" |
| "How often is gene X mutated?” | Mutation Frequency | ”What’s the TP53 mutation rate in GBM?" |
| "Does gene X predict survival?” | Survival Analysis | ”Does high DDR1 predict worse survival?" |
| "Gene X in cell lines?” | Cell Line Expression | ”DDR1 in lung cancer cell lines" |
| "Gene X protein levels?” | Protein Expression | ”TP53 protein in glioblastoma” |
Cell Line Expression (CCLE)
Analyze gene expression across ~1,000 cancer cell lines from the Cancer Cell Line Encyclopedia.
When to Use
- Studying gene expression in in vitro models
- Identifying cell lines for drug screening
- Comparing tumor vs cell line expression
- Pre-clinical target validation
Example Queries
What is DDR1 expression in lung cancer cell lines?
Compare TP53 expression across all CCLE cell lines
Which cell line sites have highest EGFR expression?Data Details
- Expression: RMA-normalized microarray data
- Sites: ~30 primary sites (lung, breast, CNS, ovary, etc.)
- Cell Lines: ~1,000 characterized cell lines
- Visualization: Boxplot by primary site
CCLE expression uses RMA normalization (not TPM), so values are not directly comparable to TCGA RNA-seq data.
Protein Expression (CPTAC)
Analyze protein abundance from mass spectrometry-based proteomics.
When to Use
- Understanding post-transcriptional regulation
- Comparing mRNA vs protein levels
- Identifying proteins that don’t correlate with mRNA
- Proteomics-based biomarker discovery
Example Queries
What is TP53 protein abundance in glioblastoma?
Compare DDR1 protein levels across CPTAC cancers
Does DDR1 mRNA correlate with protein in breast cancer?Available Cancer Types
| CPTAC Code | Cancer Type |
|---|---|
| CCRCC | Clear cell renal cell carcinoma |
| GBM | Glioblastoma |
| HNSCC | Head and neck squamous cell carcinoma |
| LSCC | Lung squamous cell carcinoma |
| LUAD | Lung adenocarcinoma |
| PDA | Pancreatic ductal adenocarcinoma |
| UCEC | Uterine corpus endometrial carcinoma |
| BRCA | Breast cancer |
| COAD | Colon adenocarcinoma |
| OV | Ovarian cancer |
Data Details
- Metric: Log2 ratio (sample vs reference)
- Method: TMT-labeled mass spectrometry
- Normalization: Median-centered log2 ratios
Protein abundance values are log2 ratios relative to a pooled reference, not absolute concentrations.
Combining Analyses
For comprehensive characterization of a gene, consider running multiple query types:
- Expression: Is the gene overexpressed in the cancer?
- Mutations: Is the gene frequently mutated?
- Survival: Does expression predict patient outcomes?
- Cell Lines: Is the gene expressed in relevant cell line models?
- Protein: Does protein abundance match mRNA expression?
This multi-angle approach provides stronger evidence for therapeutic relevance.