Deciphering the Impact of Obesity on Colorectal Cancer Severity

R Visualization Tutorial¶

R as a language provides the environment for statistical analysis and visualizations. The wide variety of packages and the ease of use and flexibility make it a powerful resource. In this tutorial, we will be exploring a biological dataset using the publicly available methods and packages for analysis and visualization.

Introduction¶

Obesity has been associated with an increased risk of several medical conditions, which can lead to further morbidity and mortality. Co-morbidities like diabetes, cardiovascular disorders, cancers, and numerous others have shown an increased prevalence with the rise of obesity hence it is important to study the impact of obesity on the incidence, progression, and severity of these diseases. According to National Cancer Institute, there is an increased risk of developing colorectal cancer by 30% for obese people compared to normal-weighed people.

Multiple papers explore the impact of obesity on disease initiation but don't check its implications on disease progression or severity in human models. With the help of public data, downstream methods, and visualizations, we aim to explore the effect of obesity on colorectal cancer severity.

Disclaimer: This blog has been created as a tutorial for exploring biological data with the help of different downstream analysis tools and visualization methods. The blog hasn't passed through a process of peer-review and doesn't claim coherence with scientific knowledge in the domain nor represents a comprehensive review of the scientific knowledge in the domain.

Loading the Required Libraries¶

In [4]:

library(cmapR) ## for reading GCT file
library(data.table) ## for data wrangling
suppressMessages(library(dplyr)) ## for data wrangling
library(tidyr) ## for data wrangling
library(GSVA) ## for performing Gene set variation analysis
suppressMessages(library(ggfortify)) ## for plotting
library(ggplot2) ## for plotting
library(ggsci) ## for plotting
library(pheatmap) ## for generating heatmap
suppressMessages(library(fgsea)) ## for performing fgsea

Reading the Dataset GCT File¶

Polly consists of processed and curated datasets from multiple public repositories and databases such as GEO, GTEx, Metabolomics Workbench, etc., on Polly Data Lakes. The user can readily search for datasets of interest or run an analysis with the state-of-the-art tools interactively and intuitively. We have used one such dataset from the paper titled Association of survival and disease progression with chromosomal instability: a genomic exploration of colorectal cancer from our GEO Data Lake.The raw data and raw counts, along with metadata, is publicly available on GEO at GSE41258.

We will be using cmapR to read the dataset available in the GCT format. For understanding the structure of a GCT file, please follow the link. Using cmapR we will parse GCT and separate the matrix and individual metadata components.

In [5]:

gct_obj <- cmapR::parse_gctx('GSE41258_GPL96_curated.gct')
counts_matrix <- as.data.frame(gct_obj@mat)
col_metadata <- gct_obj@cdesc
row_metadata <- gct_obj@rdesc
head(counts_matrix)

parsing as GCT v1.3

GSE41258_GPL96_curated.gct 12414 rows, 390 cols, 0 row descriptors, 65 col descriptors

A data.frame: 6 × 390 GSM1012278GSM1012279GSM1012280GSM1012281GSM1012282GSM1012283GSM1012284GSM1012285GSM1012286GSM1012287⋯GSM1012658GSM1012659GSM1012660GSM1012661GSM1012662GSM1012663GSM1012664GSM1012665GSM1012666GSM1012667 <dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl> ZNF6740.22650.22651.11770.22650.25100.25100.23880.22650.44360.2510⋯0.2141 0.26300.2510 0.25100.22650.26300.22650.22650.26300.1763 PITPNM30.18900.20160.23880.23880.21410.23880.22650.20160.22650.2388⋯0.2016 0.23880.2388 0.23880.37850.22650.28690.20160.22650.1635 CFAP430.16350.21410.22650.22650.20160.21410.18900.17630.20160.2016⋯0.6690 0.21410.2016 0.21411.52612.11771.58010.33341.11100.7740 SOX11.94860.32190.34480.33340.33340.35610.35610.41140.34480.3448⋯0.3103 0.35610.3896 0.36740.35610.35611.22030.32191.16990.2630 UGT2B40.47510.46470.49570.47510.53610.52610.46470.50590.54602.6112⋯0.4647-0.46590.5361-0.22100.48540.45420.38960.46470.40050.4005 SNTG20.42221.98550.40050.41142.38400.41141.00720.67810.44360.4114⋯1.7655 0.43300.4222 1.49572.06000.40053.19850.38961.48540.3103

Getting the Required Sample to Cohort Mapping¶

We are interested in getting Sample to Cohort mapping for the samples in the dataset. The mapping will be useful for annotating samples in the PCA plot and separating primary tumors later on.

In [6]: metadata <- col_metadata %>% select(geo_accession,tissue.ch1)
colnames(metadata) <- c('Sample','Cohort')
head(metadata)

A data.frame: 6 × 2

SampleCohort <chr><chr> GSM1012278GSM1012278Polyp GSM1012279GSM1012279Liver Metastasis GSM1012280GSM1012280Liver Metastasis GSM1012281GSM1012281Liver Metastasis GSM1012282GSM1012282Liver Metastasis GSM1012283GSM1012283Liver Metastasis

PCA Plot of the Dataset¶

Principal Component Analysis (PCA) is a dimensionality reduction method that helps determine the critical variables in a high-dimensional dataset that elucidate the differences in the observations under investigation and essentially allow easy exploration, analysis, and visualizations. In our case, we want to summarize how the expression of genes varies under the cohort conditions in the dataset.

In [7]:

df <- bind_cols(metadata,as.data.frame(t(counts_matrix))) # Attaching Cohort information to the counts matrix
head(df)

A data.frame: 6 × 12416 SampleCohortZNF674PITPNM3CFAP43SOX1UGT2B4SNTG2RPL37APPEF2⋯RPS4Y1SOSTDC1SPINK4SIFCGBPEIF1AYMUC2XISTNXPE4HLA-DQA1 <chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl> 1GSM1012278Polyp 0.22650.18900.16351.94860.47510.422212.50681.8156⋯ 7.40943.226510.65286.727912.25864.263010.96588.88267.28540.1506 2GSM1012279Liver Metastasis0.22650.20160.21410.32190.46471.985512.43830.4222⋯10.86880.8237 3.93553.3785 5.65256.9069 7.07682.82176.83290.7225 3GSM1012280Liver Metastasis1.11770.23880.22650.34480.49570.400512.43050.4222⋯10.54110.4647 3.37852.4983 5.56686.3309 6.74152.37857.25740.2987 4GSM1012281Liver Metastasis0.22650.23880.22650.33340.47510.411412.42261.1827⋯10.20460.5160 3.35052.2296 5.46275.4162 6.85802.80747.12930.3561 5GSM1012282Liver Metastasis0.25100.21410.20160.33340.53612.384012.60731.3277⋯10.90691.1309 3.62062.7247 3.93559.4696 4.88263.85806.71423.2434 6GSM1012283Liver Metastasis0.25100.23880.21410.35610.52610.411412.40411.8560⋯ 9.18491.0426 3.07551.1177 3.56076.9189 5.28546.14775.85800.9411

In [8]:

#We will use R's prcomp funtion to calculate principle components to understand the variance in the data in context of different cohorts.
plotDims <- options(repr.plot.width=10, repr.plot.height=8)
pca_res <- prcomp(df[3:ncol(df)], scale. = TRUE) # The prcomp uses the counts data as an input and yields principle components.
pca_plot <- autoplot(pca_res, data = df, colour = 'Cohort') # we use autoplot to make PCA plot based on the first two components
pca_plot
options(plotDims)

From the PCA plot, we can observe that even though there is heterogeneity, many primary tumor samples are segregated from other samples.

Diving Deep into Primary Tumor Samples¶

For our exploration, we will be focusing only on primary tumors and thus we will be subsetting data accordingly.

In [9]:

primary_tumor_samples_vec <- metadata %>% filter(Cohort == 'Primary Tumor') %>% pull(Sample) # Pulls the samples ids which belong to primary tumor cohort
print(paste('Number of primary tumor samples are:',length(primary_tumor_samples_vec))) # number of primary tumor samples

[1] "Number of primary tumor samples are: 186"

In [10]:

primary_tumor_samples_df <- subset(counts_matrix,select = primary_tumor_samples_vec) %>% as.data.frame()
head(primary_tumor_samples_df)

A data.frame: 6 × 186 GSM1012286GSM1012297GSM1012303GSM1012305GSM1012307GSM1012308GSM1012310GSM1012312GSM1012314GSM1012316⋯GSM1012627GSM1012629GSM1012631GSM1012634GSM1012635GSM1012639GSM1012643GSM1012645GSM1012647GSM1012651 <dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl> ZNF6740.44360.23880.25100.21410.27500.26300.21410.22650.25100.2388⋯0.25100.25100.31030.21410.2510 1.0144 0.20160.2388 0.25100.2510 PITPNM30.22650.21410.23880.20160.23880.25100.20160.21410.23880.2016⋯0.23880.21410.18900.17630.2265 0.2750 0.21411.0496 0.23880.2388 CFAP430.20160.17630.20160.17630.21410.21410.18900.20160.20160.1763⋯0.20160.20160.70490.12430.2016 0.2510 0.44361.4168 0.22650.2141 SOX10.34480.33340.35610.31030.34480.33340.32190.34480.33340.5460⋯0.34480.32190.35612.28980.3334 0.2388 0.33340.3103 2.08070.3561 UGT2B40.54600.53610.46470.47510.49570.54600.41140.54600.53610.5656⋯0.54600.49570.40050.41140.5656-1.5105-1.05590.505911.29920.5850 SNTG20.44360.45420.42220.43301.24490.42221.72251.89142.34772.0704⋯0.87970.41141.93731.58500.4436 0.3103 0.95610.4114 0.45420.4647

Exploring the Impact of the Obesity Assosciated Signature for CRC Severity¶

We identified a list of genes that have either been positively or negatively associated with Colorectal Cancer from in vivo mouse model of diet-induced obesity. The genes are human homologs of the mouse genes that have been taken from the paper.

We have split the list of genes into two groups:

Obesity_up: Set of genes which are significantly upregulated (p-value < 0.05) in Obese vs Non-Obese comparison and having a logFC greater than 1
Obesity_down: Set of genes which are significantly downregulated (p-value < 0.05) in Obese vs Non-Obese comparison and having a logFC less than -1

Further, we will use the Gene Set Variation Analysis (GSVA) method to study the variation of the gene signatures in the primary tumor samples.

In [11]:

obesity_up = scan('obesity_up_genes.txt', character(), quote = "")
obesity_down = scan('obesity_down_genes.txt', character(), quote = "")
obesity_signature_list <- list("Obesity_Up" = obesity_up, "Obesity_Down" = obesity_down)
obesity_signature_list

$Obesity_Up

'SLPI'
'RPL23A'
'FABP6'
'SLC30A10'
'THBD'
'TNXB'
'LYZ'

$Obesity_Down

'SCD'
'ALB'
'SLC2A4'
'CDK14'

We are exploring the relevance of obesity-associated signature from in vivo mouse model of diet-induced obesity. The signature has been associated with an increased CRC development but hasn't been associated with CRC severity. As a first step, we will explore the distribution of expression of these genes in our dataset.

In [12]:

gene_signature_vec <- c(obesity_up,obesity_down)

In [13]:

gene_signature_df <- subset(primary_tumor_samples_df, rownames(primary_tumor_samples_df) %in% gene_signature_vec)
gene_signature_df$Gene <- rownames(gene_signature_df)
gene_signature_df <- gene_signature_df %>% select(Gene,everything())
gene_signature_longform_df <- gather(gene_signature_df,Sample,Expression,2:ncol(gene_signature_df)) %>% arrange(Gene)
head(gene_signature_longform_df)

A data.frame: 6 × 3 GeneSampleExpression <chr><chr><dbl> 1ALBGSM10122865.2403 2ALBGSM10122975.2095 3ALBGSM10123034.9542 4ALBGSM10123054.4396 5ALBGSM10123074.7495 6ALBGSM10123084.7602

Expression Distribution for Every Gene Across Samples¶

Violin Plot

‍
The violin plot helps in visualizing the distribution of the data and its probability density. Below we see the distribution of the selected genes across the samples in the dataset:

In [14]:

plotDims <- options(repr.plot.width=15, repr.plot.height=10)
violin_plot <- ggplot(gene_signature_longform_df, aes(x=Gene, y=Expression,fill=Gene)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.1, fill="white")
violin_plot
options(plotDims)

‍

We shall further explore the distribution of the selected genes across all the samples using the density plot below.
‍

‍
Density Plot

In [15]:

plotDims <- options(repr.plot.width=15, repr.plot.height=10)
density_plot <- ggplot(data = gene_signature_longform_df, aes(x = Expression)) +
geom_density(fill = "orange") +
facet_wrap(~Gene)
density_plot
options(plotDims)

‍

Correlogram

‍
The correlogram or correlation matrix gives a quick overview of the dataset. It shows the relationship between each pair of numeric variables in the dataset. Below, we see a correlogram of the selected genes.

In [16]:

cormat <- gene_signature_df %>% select(-c(Gene)) %>% t() %>% cor()
melted_cormat <- reshape2::melt(cormat)
head(melted_cormat)

A data.frame: 6 × 3 Var1Var2value <fct><fct><dbl> 1RPL23A RPL23A 1.00000000 2SLC2A4 RPL23A-0.08284083 3ALB RPL23A-0.15836266 4SLC30A10RPL23A-0.14232357 5TNXB RPL23A 0.03109168 6THBD RPL23A-0.05700783

In [17]:

plotDims <- options(repr.plot.width=8, repr.plot.height=8)
corr_plot <- ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color='white') +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1),
axis.text.y = element_text(size = 12),
axis.title.x = element_blank(),
axis.title.y = element_blank())
corr_plot
options(plotDims)

‍

As shown in the density and the violin plot, the expression of the selected genes is highly variable. Additionally, with the many genes showing a strong correlation with each other in the correlogram, there is a probability of defining the subset of patients based on this gene signature.

GSVA¶

Gene Set Variation Analysis (GSVA) is a tool used to calculate enrichment of a gene signature in a set of samples, the variation analysis is independent of a control-based comparison, and hence can be used to study gene set variation in a large cohort. The GSVA function takes the data matrix and the signature lists as an input and gives output in a matrix with enrichment scores of the signature in each sample.

In [18]:

gsva_res <- gsva(as.matrix(primary_tumor_samples_df), obesity_signature_list, mx.diff=FALSE, verbose=FALSE, parallel.sz=1)

In [19]:

gsva_res #The matrix describing the enrichment scores.

A matrix: 2 × 186 of type dbl GSM1012286GSM1012297GSM1012303GSM1012305GSM1012307GSM1012308GSM1012310GSM1012312GSM1012314GSM1012316⋯GSM1012627GSM1012629GSM1012631GSM1012634GSM1012635GSM1012639GSM1012643GSM1012645GSM1012647GSM1012651 Obesity_Up0.49056460.2974964-0.3967182 0.4794008 0.2452058 0.2349806-0.3639420 0.51328620.54026540.5766597⋯ 0.3710006-0.2969346-0.3023370-0.32500990.3867863 0.6130981-0.3423690-0.4976638-0.30636430.3438808 Obesity_Down0.37563190.6119070-0.3877518-0.8699893-0.5493138-0.3541175-0.2502015-0.57466920.49479170.3390359⋯-0.2829170 0.4042879-0.4600481 0.67374020.3052455-0.5747815-0.6749073 0.3004835 0.62861100.4566514

Heatmap

‍
Heatmap helps in visualizing the complex data in an easy to understand manner with values depicted by colors.

We will use pheatmap to plot the distribution of GSVA enrichment score across different samples and use the tree cut methods to identify the sub-classes of interest.

In [20]:

plotDims <- options(repr.plot.width=25, repr.plot.height=10)
obesity_heatmap <- pheatmap(gsva_res,cutree_cols = 4,cluster_rows=TRUE, show_rownames=TRUE,cluster_cols=TRUE,
border_color = NA, scale = "row",main="Obesity Up and Down",cex=1)
obesity_heatmap
options(plotDims)

‍

As observed in the heatmap:

2nd tree cut represents subjects that have a lower association of obesity-associated signature as explained by high enrichment of Obesity_down and lower enrichment of Obesity_up signature.
3rd tree cut represents subjects that have a higher association of obesity-associated signature as explained by high enrichment of Obesity_up and lower enrichment of Obesity_down signature.

These groups represent ideal cases for the study of the impact of obesity on CRC severity, and we will be using the further downstream analysis to study the difference between these two groups of samples.

In [21]:

samples_category_list <- sort(cutree(obesity_heatmap$tree_col,k=4))
samples_category_list <- samples_category_list[colnames(gsva_res[,obesity_heatmap$tree_col[["order"]]])]

In [22]:

generate_severity_mapping_df <- function(sample_to_tree_list,tree_cut_number,severity_cohort){
severity_mapping_df <- stack(sample_to_tree_list[grep(tree_cut_number,sample_to_tree_list)]) %>% rename('Severity'=values,'Sample'=ind) %>% mutate_if(is.factor, as.character)
severity_mapping_df$Severity[severity_mapping_df$Severity == tree_cut_number] <- severity_cohort
return(severity_mapping_df)
}

In [23]:

# The cuts are organised in the order [2,4,3,1], Thus, to select 2nd and 3rd tree cut, we will be selecting 3 and 4
high_severity_patients_df <- generate_severity_mapping_df(samples_category_list,3,'high_severity')
print(paste('High Severity Patients are: ',nrow(high_severity_patients_df)))
low_severity_patients_df <- generate_severity_mapping_df(samples_category_list,4,'low_severity')
print(paste('Low Severity Patients are: ',nrow(low_severity_patients_df)))

[1] "High Severity Patients are: 41"
[1] "Low Severity Patients are: 50"

In [24]:

all_patients_metadata_df <- bind_rows(list(high_severity_patients_df,low_severity_patients_df))
rownames(all_patients_metadata_df) <- all_patients_metadata_df$Sample

Differential Expression¶

Differential Expression is a statistical method used to compare the difference in expression of genes between two biological conditions or cohorts. We will be using limma to do differential expression.

In [25]:

# Defining function to perform limma
performing_limma <- function(dataMatrix,metadata,cohortCol,cohortA,cohortB,pvalCorrectionMethod){
### Subsetting the metadata to have the samples that belong to the cohorts specified
metadata <- metadata[metadata[, cohortCol] %in% c(cohortA, cohortB),]

### Subsetting the counts matrix with samples that belong to the cohorts specified
dataMatrix <- dataMatrix[, rownames(metadata)]
dataMatrix <- dataMatrix[!apply(dataMatrix, 1, anyNA), ]

### Creating design matrix
condition <- metadata[,cohortCol]
condition[condition == cohortA] <- 'A'
condition[condition == cohortB] <- 'B'
design <- model.matrix(~ condition + 0)
colnames(design) <- gsub("condition", "", colnames(design))

### Creating contrast matrix
contrast_matrix <- limma::makeContrasts(contrasts = c(paste("B", "A", sep = "-")), levels = design)

### Fitting linear model to log2 normalised expression data
fit <- limma::lmFit(dataMatrix, design)
fit <- limma::contrasts.fit(fit, contrast_matrix)
fit <- limma::eBayes(fit)
limma_results_df <- limma::topTable(fit, coef = paste("B", "A", sep = "-"), number = nrow(dataMatrix))

### p value correction
for (method in pvalCorrectionMethod) {
if (method == "Bonferroni") {
p_val_correct_method <- "bonferroni"
}else {
p_val_correct_method <- "BH"
}
limma_results_df[, method] <- p.adjust(limma_results_df$P.Value, method = p_val_correct_method)
}
return(limma_results_df)
}

In [26]:

high_vs_low_severity_diff_exp <- performing_limma(counts_matrix,all_patients_metadata_df, 'Severity', 'low_severity', 'high_severity', 'BH')
head(high_vs_low_severity_diff_exp)

A data.frame: 6 × 7 logFCAveExprtP.Valueadj.P.ValBBH <dbl><dbl><dbl><dbl><dbl><dbl><dbl> ZNF821-0.25552675.908129-6.8712727.250537e-109.000817e-0611.9260819.000817e-06 REXO4-0.23532626.521310-5.8735656.635191e-084.118463e-04 7.8160434.118463e-04 CHRNA6-0.26922936.323099-5.7363781.208910e-075.002471e-04 7.2701875.002471e-04 CXCR1-0.25274136.608862-5.6179382.019418e-076.267263e-04 6.8034696.267263e-04 LMOD1-0.36463776.482519-5.5360412.871550e-077.129485e-04 6.4833367.129485e-04 ICOSLG-0.26594465.608609-5.4615043.948119e-078.168658e-04 6.1938868.168658e-04

A volcano plot is a standard visualization method used for summarising the results of differential expression. The plot simultaneously assesses the statistical significance (p-values) and measure of effect or biological difference (log ratios), which separates genes differentially into two categories, significant and non-significant. This is based on significance and up-regulated and down-regulated based on the direction of effect (log fold changes).

In [27]:

plotting_volcano <- function(diff_exp, pvalCutoff = 0.05, logFCThreshold = 0.5){
diff_exp$significance <- 'Not Significant'
diff_exp[which(diff_exp$P.Value < pvalCutoff & abs(diff_exp$logFC) >= logFCThreshold),'significance'] <- 'Significant'
diff_exp <- diff_exp[order(diff_exp$P.Value), ]

volcano_plot <- ggplot(data=diff_exp, aes(x=logFC, y=-log10(P.Value), col=significance)) +
geom_point() +
xlab("log2 fold change") +
ylab("-log10 (p-value)") +
theme_minimal()

return(volcano_plot)
}

In [28]:

volcano_plot <- plotting_volcano(diff_exp = high_vs_low_severity_diff_exp)
volcano_plot

Pathway Enrichment using GSEA¶

Gene Set Enrichment Analysis (GSEA) is used to interpret the functional profile of the gene set so as to get an insight about the underlying biological processes.

We use the fgsea package available in R to perform GSEA analysis.

In [29]:

# defining the function to perform fgsea
performing_fgsea <- function(diff_exp, gmtfile, pvalcutoff, pvalcolumn,logFCcolumn, minSize, maxSize, nperm){
pathways <- gmtPathways(gmtfile)
diff_exp <- diff_exp[diff_exp[,pvalcolumn] < pvalcutoff,]
diff_exp <- diff_exp[order(-diff_exp[logFCcolumn]),]
ranks <- setNames(diff_exp[,logFCcolumn], rownames(diff_exp))
fgseaRes <- fgsea(pathways, ranks, minSize = minSize, maxSize=maxSize, nperm=nperm)
return(fgseaRes)
}

GSEA requires a set of genes to pathway mappings available in the form of a gmt file. For this tutorial, we have used the Gene Ontology Biological Processes gmt file from Broad Institute.

In [30]:

# performing gsea.
fgseaRes <- performing_fgsea(high_vs_low_severity_diff_exp,'c5.go.bp.v7.2.symbols.gmt',
pvalcutoff = 0.05,pvalcolumn = 'P.Value',logFCcolumn = 'logFC',minSize = 15,maxSize = 500,nperm = 500)
head(fgseaRes,2)
dim(fgseaRes)

A data.table: 2 × 8 pathwaypvalpadjESNESnMoreExtremesizeleadingEdge <chr><dbl><dbl><dbl><dbl><dbl><int><list> GO_REPRODUCTION 0.29518070.59075290.2266451.1171642146242EDNRB , CFTR , SLC26A3, BMP4 , ADAM28 , TCF21 , HSD17B2, HSD11B2, PSG6 , AKR1C3 , PI3 , STK4 , HPGD , DEFB1 , BCL2L1 , TSPAN8 , WNT5A , KRT19 , CCR6 , TLR3 , PBX1 , PDGFRA , MSX2 , SIX5 , CEBPA , HSPA2 , SOX9 , NUDT1 , LGALS9 , EPAS1 , PTCH1 , LEF1 , SPINT1 , EGFR , ESR1 , CLIC5 , CNR1 GO_REGULATION_OF_DNA_RECOMBINATION0.71229050.87314900.2578820.8006788254 16ACTR2 , PRDM9 , KMT5B , RAD50 , ERCC2 , STAT6 , SETD2 , FOXP3 , SIRT6 , CD40 , TBX21 , MSH6 , MMS19 , BLM , TIMELESS, WRAP53

1509
8

Selecting the top 10 negatively and positively enriched significant pathways.

In [31]:

topPathwaysUp <- fgseaRes[NES >= 0 & pval < 0.05][head(order(pval), n=10)] # Top 10 up pathways with pval less than 0.05
topPathwaysDown <- fgseaRes[NES < 0 & pval < 0.05][head(order(pval), n=10)]# Top 10 down pathways with pval less than 0.05
topPathways <- rbind(topPathwaysUp,topPathwaysDown)
head(topPathways,2)
dim(topPathways)

A data.table: 2 × 8 pathwaypvalpadjESNESnMoreExtremesizeleadingEdge <chr><dbl><dbl><dbl><dbl><dbl><int><list> GO_RESPONSE_TO_BIOTIC_STIMULUS0.0019960080.093831610.39235001.9620710271HLA-DRB4, CXCL14 , EDNRB , MUC2 , JCHAIN , ZG16 , DMBT1 , PTK6 , SLPI , IGLC1 , CEACAM1 , TRIM31 , ISG15 , IFI27 , MNDA , CR2 , IGKC , IFIT1 , PLAC8 , PI3 , VNN1 , CD55 , RSAD2 , DDX60 , IFI6 , OAS1 , HPGD , MX1 , DEFB1 , MUC13 , TRIM22 , BANK1 , BCL2L1 , WNT5A , PSMB8 , CLDN1 , SERINC3 , LYZ , RTP4 , PYCARD , TLR3 , PRSS2 , OAS2 , PLSCR1 , PRSS3 , PSMB9 , TP53 , MX2 , IFI35 , HLA-F , CD24 , MFAP4 , IFIH1 , PTGER2 , CCL11 , HLA-B , LGALS9 , APOBEC1 , APPL2 , ZBP1 , IL22RA1 , CXCL1 , TXNIP , ERAP1 , CNR1 , ISG20 , DDX58 , EIF2AK2 , CORO1A , USP18 , OAS3 , NMI , ZC3HAV1 , EPHA2 , A2M , HNMT , ACTR2 , GNG12 , RIOK3 , HERC6 , PSMB10 , CAPN2 , KIF16B , LITAF , HLA-G , LGALS3 , KRT8 , IRF9 , HLA-A , CCR7 , ZNF175 , ELF4 , TRIM14 , TIGAR , CDHR2 , HLA-E , MST1R , TRIM38 GO_IMMUNE_EFFECTOR_PROCESS 0.0020000000.093831610.40899632.0092930222OLFM4 , CEACAM6 , GCNT3 , DMBT1 , SLPI , IGLC1 , CEACAM1 , CDH17 , AOC1 , CXCR2 , ISG15 , IGLV1-44, IFI27 , MNDA , CR2 , IGKC , CEACAM3 , IFIT1 , PLAC8 , VNN1 , CD55 , RSAD2 , DDX60 , PTPRN2 , IFI6 , OAS1 , MX1 , TRIM22 , BCL2L1 , FOXF1 , WNT5A , TCN1 , SERINC3 , LYZ , RTP4 , CCR6 , PYCARD , TLR3 , PRSS2 , OAS2 , PLSCR1 , PRSS3 , TP53 , IL5 , MX2 , HLA-F , MFAP4 , IFIH1 , CLC , HLA-B , LGALS9 , APOBEC1 , APPL2 , ATP8A1 , ZBP1 , CXCL1 , LEF1 , CTSA , ISG20 , DDX58 , CTSS , BTN3A3 , EIF2AK2 , CORO1A , RHOF , OAS3 , ZC3HAV1 , A2M , JUP , ACTR2 , MVP

We shall visualize the top 10 positively and negatively enriched pathways

In [32]:

# Summarising results using barplot.
topPathways$pathway <- factor(topPathways$pathway, levels = topPathways$pathway[order(topPathways$NES)])
plotDims <- options(repr.plot.width=12, repr.plot.height=8)
fgsea_viz <- ggplot(topPathways, aes(x = pathway, y = NES)) +
geom_col(aes(fill = NES), width = 0.7) +
labs(x = 'Pathways', y = 'Normalized enrichment score') +
theme_minimal() +
theme(
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12)) +
coord_flip()
fgsea_viz
options(plotDims)

Results¶

Predictions based on this analysis:

As shown by the barplot above, obesity-related signature has been associated with pathways such as

Innate immune response
Response to cytokines
Regulation of immune system process
Regulation of cell population proliferation

These pathways are related to inflammation and increased proliferation, which are hallmarks of cancers. Their increased enrichment indicates increased cancer severity. Further, we identified the relevance of these signatures in patients' transcriptomic samples, validating the original paper's findings in human colorectal cancer subjects.