R Visualization Tutorial¶
R as a language provides the environment for statistical analysis and visualizations. The wide variety of packages and the ease of use and flexibility make it a powerful resource. In this tutorial, we will be exploring a biological dataset using the publicly available methods and packages for analysis and visualization.
Introduction¶
Obesity has been associated with an increased risk of several medical conditions, which can lead to further morbidity and mortality. Co-morbidities like diabetes, cardiovascular disorders, cancers, and numerous others have shown an increased prevalence with the rise of obesity hence it is important to study the impact of obesity on the incidence, progression, and severity of these diseases. According to National Cancer Institute, there is an increased risk of developing colorectal cancer by 30% for obese people compared to normal-weighed people.
Multiple papers explore the impact of obesity on disease initiation but don’t check its implications on disease progression or severity in human models. With the help of public data, downstream methods, and visualizations, we aim to explore the effect of obesity on colorectal cancer severity.
Disclaimer: This blog has been created as a tutorial for exploring biological data with the help of different downstream analysis tools and visualization methods. The blog hasn’t passed through a process of peer-review and doesn’t claim coherence with scientific knowledge in the domain nor represents a comprehensive review of the scientific knowledge in the domain.
Loading the required libraries¶
library(cmapR) ## for reading GCT file
library(data.table) ## for data wrangling
suppressMessages(library(dplyr)) ## for data wrangling
library(tidyr) ## for data wrangling
library(GSVA) ## for performing Gene set variation analysis
suppressMessages(library(ggfortify)) ## for plotting
library(ggplot2) ## for plotting
library(ggsci) ## for plotting
library(pheatmap) ## for generating heatmap
suppressMessages(library(fgsea)) ## for performing fgsea
Reading the dataset GCT file¶
Polly consists of processed and curated datasets from multiple public repositories and databases such as GEO, GTEx, Metabolomics Workbench, etc., on Polly Data Lakes. The user can readily search for datasets of interest or run an analysis with the state-of-the-art tools interactively and intuitively. We have used one such dataset from the paper titled Association of survival and disease progression with chromosomal instability: a genomic exploration of colorectal cancer from our GEO Data Lake. The raw data and raw counts, along with metadata, is publicly available on GEO at GSE41258.
We will be using cmapR to read the dataset available in the GCT format. For understanding the structure of a GCT file, please follow the link. Using cmapR we will parse GCT and separate the matrix and individual metadata components.
gct_obj <- cmapR::parse_gctx('GSE41258_GPL96_curated.gct')
counts_matrix <- as.data.frame(gct_obj@mat)
col_metadata <- gct_obj@cdesc
row_metadata <- gct_obj@rdesc
head(counts_matrix)
Getting the required sample to cohort mapping¶
We are interested in getting Sample to Cohort mapping for the samples in the dataset. The mapping will be useful for annotating samples in the PCA plot and separating primary tumors later on.
metadata <- col_metadata %>% select(geo_accession,tissue.ch1)
colnames(metadata) <- c('Sample','Cohort')
head(metadata)
PCA plot of the dataset¶
Principal Component Analysis (PCA) is a dimensionality reduction method that helps determine the critical variables in a high-dimensional dataset that elucidate the differences in the observations under investigation and essentially allow easy exploration, analysis, and visualizations. In our case, we want to summarize how the expression of genes varies under the cohort conditions in the dataset.
df <- bind_cols(metadata,as.data.frame(t(counts_matrix))) # Attaching Cohort information to the counts matrix
head(df)
#We will use R's prcomp funtion to calculate principle components to understand the variance in the data in context of different cohorts.
plotDims <- options(repr.plot.width=10, repr.plot.height=8)
pca_res <- prcomp(df[3:ncol(df)], scale. = TRUE) # The prcomp uses the counts data as an input and yields principle components.
pca_plot <- autoplot(pca_res, data = df, colour = 'Cohort') # we use autoplot to make PCA plot based on the first two components
pca_plot
options(plotDims)
From the PCA plot, we can observe that even though there is heterogeneity, many primary tumor samples are segregated from other samples.
Diving deep into primary tumor samples¶
For our exploration, we will be focusing only on primary tumors and thus we will be subsetting data accordingly.
primary_tumor_samples_vec <- metadata %>% filter(Cohort == 'Primary Tumor') %>% pull(Sample) # Pulls the samples ids which belong to primary tumor cohort
print(paste('Number of primary tumor samples are:',length(primary_tumor_samples_vec))) # number of primary tumor samples
primary_tumor_samples_df <- subset(counts_matrix,select = primary_tumor_samples_vec) %>% as.data.frame()
head(primary_tumor_samples_df)
Exploring the impact of the obesity assosciated signature for CRC severity¶
We identified a list of genes that have either been positively or negatively associated with Colorectal Cancer from in vivo mouse model of diet-induced obesity. The genes are human homologs of the mouse genes that have been taken from the paper.
We have split the list of genes into two groups:
- Obesity_up: Set of genes which are significantly upregulated (p-value < 0.05) in Obese vs Non-Obese comparison and having a logFC greater than 1
- Obesity_down: Set of genes which are significantly downregulated (p-value < 0.05) in Obese vs Non-Obese comparison and having a logFC less than -1
Further, we will use the Gene Set Variation Analysis (GSVA) method to study the variation of the gene signatures in the primary tumor samples.
obesity_up = scan('obesity_up_genes.txt', character(), quote = "")
obesity_down = scan('obesity_down_genes.txt', character(), quote = "")
obesity_signature_list <- list("Obesity_Up" = obesity_up, "Obesity_Down" = obesity_down)
obesity_signature_list
We are exploring the relevance of obesity-associated signature from in vivo mouse model of diet-induced obesity. The signature has been associated with an increased CRC development but hasn’t been associated with CRC severity. As a first step, we will explore the distribution of expression of these genes in our dataset.
gene_signature_vec <- c(obesity_up,obesity_down)
gene_signature_df <- subset(primary_tumor_samples_df, rownames(primary_tumor_samples_df) %in% gene_signature_vec)
gene_signature_df$Gene <- rownames(gene_signature_df)
gene_signature_df <- gene_signature_df %>% select(Gene,everything())
gene_signature_longform_df <- gather(gene_signature_df,Sample,Expression,2:ncol(gene_signature_df)) %>% arrange(Gene)
head(gene_signature_longform_df)
Expression distribution for every gene across samples¶
Violin Plot
The violin plot helps in visualizing the distribution of the data and its probability density. Below we see the distribution of the selected genes across the samples in the dataset:
plotDims <- options(repr.plot.width=15, repr.plot.height=10)
violin_plot <- ggplot(gene_signature_longform_df, aes(x=Gene, y=Expression,fill=Gene)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.1, fill="white")
violin_plot
options(plotDims)
We shall further explore the distribution of the selected genes across all the samples using the density plot below.
Density Plot
plotDims <- options(repr.plot.width=15, repr.plot.height=10)
density_plot <- ggplot(data = gene_signature_longform_df, aes(x = Expression)) +
geom_density(fill = "orange") +
facet_wrap(~Gene)
density_plot
options(plotDims)
Correlogram
The correlogram or correlation matrix gives a quick overview of the dataset. It shows the relationship between each pair of numeric variables in the dataset. Below, we see a correlogram of the selected genes.
cormat <- gene_signature_df %>% select(-c(Gene)) %>% t() %>% cor()
melted_cormat <- reshape2::melt(cormat)
head(melted_cormat)
plotDims <- options(repr.plot.width=8, repr.plot.height=8)
corr_plot <- ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color='white') +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1),
axis.text.y = element_text(size = 12),
axis.title.x = element_blank(),
axis.title.y = element_blank())
corr_plot
options(plotDims)
As shown in the density and the violin plot, the expression of the selected genes is highly variable. Additionally, with the many genes showing a strong correlation with each other in the correlogram, there is a probability of defining the subset of patients based on this gene signature.
GSVA¶
Gene Set Variation Analysis (GSVA) is a tool used to calculate enrichment of a gene signature in a set of samples, the variation analysis is independent of a control-based comparison, and hence can be used to study gene set variation in a large cohort. The GSVA function takes the data matrix and the signature lists as an input and gives output in a matrix with enrichment scores of the signature in each sample.
gsva_res <- gsva(as.matrix(primary_tumor_samples_df), obesity_signature_list, mx.diff=FALSE, verbose=FALSE, parallel.sz=1)
gsva_res #The matrix describing the enrichment scores.
Heatmap
Heatmap helps in visualizing the complex data in an easy to understand manner with values depicted by colors.
We will use pheatmap to plot the distribution of GSVA enrichment score across different samples and use the tree cut methods to identify the sub-classes of interest.
plotDims <- options(repr.plot.width=25, repr.plot.height=10)
obesity_heatmap <- pheatmap(gsva_res,cutree_cols = 4,cluster_rows=TRUE, show_rownames=TRUE,cluster_cols=TRUE,
border_color = NA, scale = "row",main="Obesity Up and Down",cex=1)
obesity_heatmap
options(plotDims)
As observed in the heatmap:
- 2nd tree cut represents subjects that have a lower association of obesity-associated signature as explained by high enrichment of Obesity_down and lower enrichment of Obesity_up signature.
- 3rd tree cut represents subjects that have a higher association of obesity-associated signature as explained by high enrichment of Obesity_up and lower enrichment of Obesity_down signature.
These groups represent ideal cases for the study of the impact of obesity on CRC severity, and we will be using the further downstream analysis to study the difference between these two groups of samples.
samples_category_list <- sort(cutree(obesity_heatmap$tree_col,k=4))
samples_category_list <- samples_category_list[colnames(gsva_res[,obesity_heatmap$tree_col[["order"]]])]
generate_severity_mapping_df <- function(sample_to_tree_list,tree_cut_number,severity_cohort){
severity_mapping_df <- stack(sample_to_tree_list[grep(tree_cut_number,sample_to_tree_list)]) %>% rename('Severity'=values,'Sample'=ind) %>% mutate_if(is.factor, as.character)
severity_mapping_df$Severity[severity_mapping_df$Severity == tree_cut_number] <- severity_cohort
return(severity_mapping_df)
}
# The cuts are organised in the order [2,4,3,1], Thus, to select 2nd and 3rd tree cut, we will be selecting 3 and 4
high_severity_patients_df <- generate_severity_mapping_df(samples_category_list,3,'high_severity')
print(paste('High Severity Patients are: ',nrow(high_severity_patients_df)))
low_severity_patients_df <- generate_severity_mapping_df(samples_category_list,4,'low_severity')
print(paste('Low Severity Patients are: ',nrow(low_severity_patients_df)))
all_patients_metadata_df <- bind_rows(list(high_severity_patients_df,low_severity_patients_df))
rownames(all_patients_metadata_df) <- all_patients_metadata_df$Sample
# Defining function to perform limma
performing_limma <- function(dataMatrix,metadata,cohortCol,cohortA,cohortB,pvalCorrectionMethod){
### Subsetting the metadata to have the samples that belong to the cohorts specified
metadata <- metadata[metadata[, cohortCol] %in% c(cohortA, cohortB),]
### Subsetting the counts matrix with samples that belong to the cohorts specified
dataMatrix <- dataMatrix[, rownames(metadata)]
dataMatrix <- dataMatrix[!apply(dataMatrix, 1, anyNA), ]
### Creating design matrix
condition <- metadata[,cohortCol]
condition[condition == cohortA] <- 'A'
condition[condition == cohortB] <- 'B'
design <- model.matrix(~ condition + 0)
colnames(design) <- gsub("condition", "", colnames(design))
### Creating contrast matrix
contrast_matrix <- limma::makeContrasts(contrasts = c(paste("B", "A", sep = "-")), levels = design)
### Fitting linear model to log2 normalised expression data
fit <- limma::lmFit(dataMatrix, design)
fit <- limma::contrasts.fit(fit, contrast_matrix)
fit <- limma::eBayes(fit)
limma_results_df <- limma::topTable(fit, coef = paste("B", "A", sep = "-"), number = nrow(dataMatrix))
### p value correction
for (method in pvalCorrectionMethod) {
if (method == "Bonferroni") {
p_val_correct_method <- "bonferroni"
}else {
p_val_correct_method <- "BH"
}
limma_results_df[, method] <- p.adjust(limma_results_df$P.Value, method = p_val_correct_method)
}
return(limma_results_df)
}
high_vs_low_severity_diff_exp <- performing_limma(counts_matrix,all_patients_metadata_df, 'Severity', 'low_severity', 'high_severity', 'BH')
head(high_vs_low_severity_diff_exp)
A volcano plot is a standard visualization method used for summarising the results of differential expression. The plot simultaneously assesses the statistical significance (p-values) and measure of effect or biological difference (log ratios), which separates genes differentially into two categories, significant and non-significant. This is based on significance and up-regulated and down-regulated based on the direction of effect (log fold changes).
plotting_volcano <- function(diff_exp, pvalCutoff = 0.05, logFCThreshold = 0.5){
diff_exp$significance <- 'Not Significant'
diff_exp[which(diff_exp$P.Value < pvalCutoff & abs(diff_exp$logFC) >= logFCThreshold),'significance'] <- 'Significant'
diff_exp <- diff_exp[order(diff_exp$P.Value), ]
volcano_plot <- ggplot(data=diff_exp, aes(x=logFC, y=-log10(P.Value), col=significance)) +
geom_point() +
xlab("log2 fold change") +
ylab("-log10 (p-value)") +
theme_minimal()
return(volcano_plot)
}
volcano_plot <- plotting_volcano(diff_exp = high_vs_low_severity_diff_exp)
volcano_plot
# defining the function to perform fgsea
performing_fgsea <- function(diff_exp, gmtfile, pvalcutoff, pvalcolumn,logFCcolumn, minSize, maxSize, nperm){
pathways <- gmtPathways(gmtfile)
diff_exp <- diff_exp[diff_exp[,pvalcolumn] < pvalcutoff,]
diff_exp <- diff_exp[order(-diff_exp[logFCcolumn]),]
ranks <- setNames(diff_exp[,logFCcolumn], rownames(diff_exp))
fgseaRes <- fgsea(pathways, ranks, minSize = minSize, maxSize=maxSize, nperm=nperm)
return(fgseaRes)
}
GSEA requires a set of genes to pathway mappings available in the form of a gmt file. For this tutorial, we have used the Gene Ontology Biological Processes gmt file from Broad Institute.
# performing gsea.
fgseaRes <- performing_fgsea(high_vs_low_severity_diff_exp,'c5.go.bp.v7.2.symbols.gmt',
pvalcutoff = 0.05,pvalcolumn = 'P.Value',logFCcolumn = 'logFC',minSize = 15,maxSize = 500,nperm = 500)
head(fgseaRes,2)
dim(fgseaRes)
Selecting the top 10 negatively and positively enriched significant pathways.
topPathwaysUp <- fgseaRes[NES >= 0 & pval < 0.05][head(order(pval), n=10)] # Top 10 up pathways with pval less than 0.05
topPathwaysDown <- fgseaRes[NES < 0 & pval < 0.05][head(order(pval), n=10)]# Top 10 down pathways with pval less than 0.05
topPathways <- rbind(topPathwaysUp,topPathwaysDown)
head(topPathways,2)
dim(topPathways)
We shall visualize the top 10 positively and negatively enriched pathways
# Summarising results using barplot.
topPathways$pathway <- factor(topPathways$pathway, levels = topPathways$pathway[order(topPathways$NES)])
plotDims <- options(repr.plot.width=12, repr.plot.height=8)
fgsea_viz <- ggplot(topPathways, aes(x = pathway, y = NES)) +
geom_col(aes(fill = NES), width = 0.7) +
labs(x = 'Pathways', y = 'Normalized enrichment score') +
theme_minimal() +
theme(
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12)) +
coord_flip()
fgsea_viz
options(plotDims)