Elucidata Logo Elucidata Logo
  • Platform
  • Solutions
    • By Use Case:
    • Build & Bring your Workflows
    • Scale Exploratory Bioinformatics
    • Discover Assets
    • By Industry
  • Resources
  • Company
    • About Us
    • Careers
  • Community
    • Forum
    • Slack
    • Blog
Login
WATCH EVENT

Categories

  • Big Data
  • Bioinformatics
  • Data Science
  • Engineering
  • Polly

Tags

  • obesity
Share
Blog > Deciphering the impact of obesity on colorectal cancer severity

Deciphering the impact of obesity on colorectal cancer severity

Maheswari K., Ayush Praveen, Manmeet S. Dayal
[FINAL] Tutorial with GSE41258_GPL96

R Visualization Tutorial¶

R as a language provides the environment for statistical analysis and visualizations. The wide variety of packages and the ease of use and flexibility make it a powerful resource. In this tutorial, we will be exploring a biological dataset using the publicly available methods and packages for analysis and visualization.

Introduction¶

Obesity has been associated with an increased risk of several medical conditions, which can lead to further morbidity and mortality. Co-morbidities like diabetes, cardiovascular disorders, cancers, and numerous others have shown an increased prevalence with the rise of obesity hence it is important to study the impact of obesity on the incidence, progression, and severity of these diseases. According to National Cancer Institute, there is an increased risk of developing colorectal cancer by 30% for obese people compared to normal-weighed people.

Multiple papers explore the impact of obesity on disease initiation but don’t check its implications on disease progression or severity in human models. With the help of public data, downstream methods, and visualizations, we aim to explore the effect of obesity on colorectal cancer severity.

Disclaimer: This blog has been created as a tutorial for exploring biological data with the help of different downstream analysis tools and visualization methods. The blog hasn’t passed through a process of peer-review and doesn’t claim coherence with scientific knowledge in the domain nor represents a comprehensive review of the scientific knowledge in the domain.

Loading the required libraries¶

In [4]:
library(cmapR) ## for reading GCT file
library(data.table) ## for data wrangling
suppressMessages(library(dplyr)) ## for data wrangling
library(tidyr) ## for data wrangling
library(GSVA) ## for performing Gene set variation analysis
suppressMessages(library(ggfortify)) ## for plotting
library(ggplot2) ## for plotting
library(ggsci) ## for plotting
library(pheatmap) ## for generating heatmap
suppressMessages(library(fgsea)) ## for performing fgsea

Reading the dataset GCT file¶

Polly consists of processed and curated datasets from multiple public repositories and databases such as GEO, GTEx, Metabolomics Workbench, etc., on Polly Data Lakes. The user can readily search for datasets of interest or run an analysis with the state-of-the-art tools interactively and intuitively. We have used one such dataset from the paper titled Association of survival and disease progression with chromosomal instability: a genomic exploration of colorectal cancer from our GEO Data Lake. The raw data and raw counts, along with metadata, is publicly available on GEO at GSE41258.

We will be using cmapR to read the dataset available in the GCT format. For understanding the structure of a GCT file, please follow the link. Using cmapR we will parse GCT and separate the matrix and individual metadata components.

In [5]:
gct_obj <- cmapR::parse_gctx('GSE41258_GPL96_curated.gct')
counts_matrix <- as.data.frame(gct_obj@mat)
col_metadata <- gct_obj@cdesc
row_metadata <- gct_obj@rdesc
head(counts_matrix)
parsing as GCT v1.3

GSE41258_GPL96_curated.gct 12414 rows, 390 cols, 0 row descriptors, 65 col descriptors

A data.frame: 6 × 390
GSM1012278GSM1012279GSM1012280GSM1012281GSM1012282GSM1012283GSM1012284GSM1012285GSM1012286GSM1012287⋯GSM1012658GSM1012659GSM1012660GSM1012661GSM1012662GSM1012663GSM1012664GSM1012665GSM1012666GSM1012667
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
ZNF6740.22650.22651.11770.22650.25100.25100.23880.22650.44360.2510⋯0.2141 0.26300.2510 0.25100.22650.26300.22650.22650.26300.1763
PITPNM30.18900.20160.23880.23880.21410.23880.22650.20160.22650.2388⋯0.2016 0.23880.2388 0.23880.37850.22650.28690.20160.22650.1635
CFAP430.16350.21410.22650.22650.20160.21410.18900.17630.20160.2016⋯0.6690 0.21410.2016 0.21411.52612.11771.58010.33341.11100.7740
SOX11.94860.32190.34480.33340.33340.35610.35610.41140.34480.3448⋯0.3103 0.35610.3896 0.36740.35610.35611.22030.32191.16990.2630
UGT2B40.47510.46470.49570.47510.53610.52610.46470.50590.54602.6112⋯0.4647-0.46590.5361-0.22100.48540.45420.38960.46470.40050.4005
SNTG20.42221.98550.40050.41142.38400.41141.00720.67810.44360.4114⋯1.7655 0.43300.4222 1.49572.06000.40053.19850.38961.48540.3103

Getting the required sample to cohort mapping¶

We are interested in getting Sample to Cohort mapping for the samples in the dataset. The mapping will be useful for annotating samples in the PCA plot and separating primary tumors later on.

In [6]:
metadata <- col_metadata %>% select(geo_accession,tissue.ch1)
colnames(metadata) <- c('Sample','Cohort')
head(metadata)
A data.frame: 6 × 2
SampleCohort
<chr><chr>
GSM1012278GSM1012278Polyp
GSM1012279GSM1012279Liver Metastasis
GSM1012280GSM1012280Liver Metastasis
GSM1012281GSM1012281Liver Metastasis
GSM1012282GSM1012282Liver Metastasis
GSM1012283GSM1012283Liver Metastasis

PCA plot of the dataset¶

Principal Component Analysis (PCA) is a dimensionality reduction method that helps determine the critical variables in a high-dimensional dataset that elucidate the differences in the observations under investigation and essentially allow easy exploration, analysis, and visualizations. In our case, we want to summarize how the expression of genes varies under the cohort conditions in the dataset.

In [7]:
df <- bind_cols(metadata,as.data.frame(t(counts_matrix))) # Attaching Cohort information to the counts matrix
head(df)
A data.frame: 6 × 12416
SampleCohortZNF674PITPNM3CFAP43SOX1UGT2B4SNTG2RPL37APPEF2⋯RPS4Y1SOSTDC1SPINK4SIFCGBPEIF1AYMUC2XISTNXPE4HLA-DQA1
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1GSM1012278Polyp 0.22650.18900.16351.94860.47510.422212.50681.8156⋯ 7.40943.226510.65286.727912.25864.263010.96588.88267.28540.1506
2GSM1012279Liver Metastasis0.22650.20160.21410.32190.46471.985512.43830.4222⋯10.86880.8237 3.93553.3785 5.65256.9069 7.07682.82176.83290.7225
3GSM1012280Liver Metastasis1.11770.23880.22650.34480.49570.400512.43050.4222⋯10.54110.4647 3.37852.4983 5.56686.3309 6.74152.37857.25740.2987
4GSM1012281Liver Metastasis0.22650.23880.22650.33340.47510.411412.42261.1827⋯10.20460.5160 3.35052.2296 5.46275.4162 6.85802.80747.12930.3561
5GSM1012282Liver Metastasis0.25100.21410.20160.33340.53612.384012.60731.3277⋯10.90691.1309 3.62062.7247 3.93559.4696 4.88263.85806.71423.2434
6GSM1012283Liver Metastasis0.25100.23880.21410.35610.52610.411412.40411.8560⋯ 9.18491.0426 3.07551.1177 3.56076.9189 5.28546.14775.85800.9411
In [8]:
#We will use R's prcomp funtion to calculate principle components to understand the variance in the data in context of different cohorts. 
plotDims <- options(repr.plot.width=10, repr.plot.height=8)
pca_res <- prcomp(df[3:ncol(df)], scale. = TRUE) # The prcomp uses the counts data as an input and yields principle components. 
pca_plot <- autoplot(pca_res, data = df, colour = 'Cohort') # we use autoplot to make PCA plot based on the first two components
pca_plot
options(plotDims)

From the PCA plot, we can observe that even though there is heterogeneity, many primary tumor samples are segregated from other samples.

Diving deep into primary tumor samples¶

For our exploration, we will be focusing only on primary tumors and thus we will be subsetting data accordingly.

In [9]:
primary_tumor_samples_vec <- metadata %>% filter(Cohort == 'Primary Tumor') %>% pull(Sample) # Pulls the samples ids which belong to primary tumor cohort
print(paste('Number of primary tumor samples are:',length(primary_tumor_samples_vec))) # number of primary tumor samples
[1] "Number of primary tumor samples are: 186"
In [10]:
primary_tumor_samples_df <- subset(counts_matrix,select = primary_tumor_samples_vec) %>% as.data.frame()
head(primary_tumor_samples_df)
A data.frame: 6 × 186
GSM1012286GSM1012297GSM1012303GSM1012305GSM1012307GSM1012308GSM1012310GSM1012312GSM1012314GSM1012316⋯GSM1012627GSM1012629GSM1012631GSM1012634GSM1012635GSM1012639GSM1012643GSM1012645GSM1012647GSM1012651
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
ZNF6740.44360.23880.25100.21410.27500.26300.21410.22650.25100.2388⋯0.25100.25100.31030.21410.2510 1.0144 0.20160.2388 0.25100.2510
PITPNM30.22650.21410.23880.20160.23880.25100.20160.21410.23880.2016⋯0.23880.21410.18900.17630.2265 0.2750 0.21411.0496 0.23880.2388
CFAP430.20160.17630.20160.17630.21410.21410.18900.20160.20160.1763⋯0.20160.20160.70490.12430.2016 0.2510 0.44361.4168 0.22650.2141
SOX10.34480.33340.35610.31030.34480.33340.32190.34480.33340.5460⋯0.34480.32190.35612.28980.3334 0.2388 0.33340.3103 2.08070.3561
UGT2B40.54600.53610.46470.47510.49570.54600.41140.54600.53610.5656⋯0.54600.49570.40050.41140.5656-1.5105-1.05590.505911.29920.5850
SNTG20.44360.45420.42220.43301.24490.42221.72251.89142.34772.0704⋯0.87970.41141.93731.58500.4436 0.3103 0.95610.4114 0.45420.4647

Exploring the impact of the obesity assosciated signature for CRC severity¶

We identified a list of genes that have either been positively or negatively associated with Colorectal Cancer from in vivo mouse model of diet-induced obesity. The genes are human homologs of the mouse genes that have been taken from the paper.

We have split the list of genes into two groups:

  • Obesity_up: Set of genes which are significantly upregulated (p-value < 0.05) in Obese vs Non-Obese comparison and having a logFC greater than 1
  • Obesity_down: Set of genes which are significantly downregulated (p-value < 0.05) in Obese vs Non-Obese comparison and having a logFC less than -1

Further, we will use the Gene Set Variation Analysis (GSVA) method to study the variation of the gene signatures in the primary tumor samples.

In [11]:
obesity_up = scan('obesity_up_genes.txt', character(), quote = "")
obesity_down = scan('obesity_down_genes.txt', character(), quote = "")
obesity_signature_list <- list("Obesity_Up" = obesity_up, "Obesity_Down" = obesity_down)
obesity_signature_list
$Obesity_Up
  1. ‘SLPI’
  2. ‘RPL23A’
  3. ‘FABP6’
  4. ‘SLC30A10’
  5. ‘THBD’
  6. ‘TNXB’
  7. ‘LYZ’
$Obesity_Down
  1. ‘SCD’
  2. ‘ALB’
  3. ‘SLC2A4’
  4. ‘CDK14’

We are exploring the relevance of obesity-associated signature from in vivo mouse model of diet-induced obesity. The signature has been associated with an increased CRC development but hasn’t been associated with CRC severity. As a first step, we will explore the distribution of expression of these genes in our dataset.

In [12]:
gene_signature_vec <- c(obesity_up,obesity_down)
In [13]:
gene_signature_df <- subset(primary_tumor_samples_df, rownames(primary_tumor_samples_df) %in% gene_signature_vec)
gene_signature_df$Gene <- rownames(gene_signature_df)
gene_signature_df <- gene_signature_df %>% select(Gene,everything())
gene_signature_longform_df <- gather(gene_signature_df,Sample,Expression,2:ncol(gene_signature_df)) %>% arrange(Gene)
head(gene_signature_longform_df)
A data.frame: 6 × 3
GeneSampleExpression
<chr><chr><dbl>
1ALBGSM10122865.2403
2ALBGSM10122975.2095
3ALBGSM10123034.9542
4ALBGSM10123054.4396
5ALBGSM10123074.7495
6ALBGSM10123084.7602

Expression distribution for every gene across samples¶

Violin Plot
The violin plot helps in visualizing the distribution of the data and its probability density. Below we see the distribution of the selected genes across the samples in the dataset:

In [14]:
plotDims <- options(repr.plot.width=15, repr.plot.height=10)
violin_plot <- ggplot(gene_signature_longform_df, aes(x=Gene, y=Expression,fill=Gene)) + 
                geom_violin(trim=FALSE) +
                geom_boxplot(width=0.1, fill="white")
violin_plot
options(plotDims)

We shall further explore the distribution of the selected genes across all the samples using the density plot below.

Density Plot

In [15]:
plotDims <- options(repr.plot.width=15, repr.plot.height=10)
density_plot <- ggplot(data = gene_signature_longform_df, aes(x = Expression)) + 
                    geom_density(fill = "orange") +
                    facet_wrap(~Gene)
density_plot
options(plotDims)

Correlogram
The correlogram or correlation matrix gives a quick overview of the dataset. It shows the relationship between each pair of numeric variables in the dataset. Below, we see a correlogram of the selected genes.

In [16]:
cormat <- gene_signature_df %>% select(-c(Gene)) %>% t() %>% cor()
melted_cormat <- reshape2::melt(cormat)
head(melted_cormat)
A data.frame: 6 × 3
Var1Var2value
<fct><fct><dbl>
1RPL23A RPL23A 1.00000000
2SLC2A4 RPL23A-0.08284083
3ALB RPL23A-0.15836266
4SLC30A10RPL23A-0.14232357
5TNXB RPL23A 0.03109168
6THBD RPL23A-0.05700783
In [17]:
plotDims <- options(repr.plot.width=8, repr.plot.height=8)
corr_plot <- ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
                geom_tile(color='white') +
                scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                    midpoint = 0, limit = c(-1,1), space = "Lab", 
                    name="Pearson\nCorrelation") +
                theme_minimal() +
                theme(
                    axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1),
                    axis.text.y = element_text(size = 12),
                    axis.title.x = element_blank(),
                    axis.title.y = element_blank())
corr_plot
options(plotDims)

As shown in the density and the violin plot, the expression of the selected genes is highly variable. Additionally, with the many genes showing a strong correlation with each other in the correlogram, there is a probability of defining the subset of patients based on this gene signature.

GSVA¶

Gene Set Variation Analysis (GSVA) is a tool used to calculate enrichment of a gene signature in a set of samples, the variation analysis is independent of a control-based comparison, and hence can be used to study gene set variation in a large cohort. The GSVA function takes the data matrix and the signature lists as an input and gives output in a matrix with enrichment scores of the signature in each sample.

In [18]:
gsva_res <- gsva(as.matrix(primary_tumor_samples_df), obesity_signature_list, mx.diff=FALSE, verbose=FALSE, parallel.sz=1)
In [19]:
gsva_res #The matrix describing the enrichment scores.
A matrix: 2 × 186 of type dbl
GSM1012286GSM1012297GSM1012303GSM1012305GSM1012307GSM1012308GSM1012310GSM1012312GSM1012314GSM1012316⋯GSM1012627GSM1012629GSM1012631GSM1012634GSM1012635GSM1012639GSM1012643GSM1012645GSM1012647GSM1012651
Obesity_Up0.49056460.2974964-0.3967182 0.4794008 0.2452058 0.2349806-0.3639420 0.51328620.54026540.5766597⋯ 0.3710006-0.2969346-0.3023370-0.32500990.3867863 0.6130981-0.3423690-0.4976638-0.30636430.3438808
Obesity_Down0.37563190.6119070-0.3877518-0.8699893-0.5493138-0.3541175-0.2502015-0.57466920.49479170.3390359⋯-0.2829170 0.4042879-0.4600481 0.67374020.3052455-0.5747815-0.6749073 0.3004835 0.62861100.4566514

Heatmap
Heatmap helps in visualizing the complex data in an easy to understand manner with values depicted by colors.

We will use pheatmap to plot the distribution of GSVA enrichment score across different samples and use the tree cut methods to identify the sub-classes of interest.

In [20]:
plotDims <- options(repr.plot.width=25, repr.plot.height=10)
obesity_heatmap <- pheatmap(gsva_res,cutree_cols = 4,cluster_rows=TRUE, show_rownames=TRUE,cluster_cols=TRUE, 
                     border_color = NA, scale = "row",main="Obesity Up and Down",cex=1)
obesity_heatmap
options(plotDims)

As observed in the heatmap:

  • 2nd tree cut represents subjects that have a lower association of obesity-associated signature as explained by high enrichment of Obesity_down and lower enrichment of Obesity_up signature.
  • 3rd tree cut represents subjects that have a higher association of obesity-associated signature as explained by high enrichment of Obesity_up and lower enrichment of Obesity_down signature.

These groups represent ideal cases for the study of the impact of obesity on CRC severity, and we will be using the further downstream analysis to study the difference between these two groups of samples.

In [21]:
samples_category_list <- sort(cutree(obesity_heatmap$tree_col,k=4))
samples_category_list <- samples_category_list[colnames(gsva_res[,obesity_heatmap$tree_col[["order"]]])]
In [22]:
generate_severity_mapping_df <- function(sample_to_tree_list,tree_cut_number,severity_cohort){
    severity_mapping_df <- stack(sample_to_tree_list[grep(tree_cut_number,sample_to_tree_list)]) %>% rename('Severity'=values,'Sample'=ind) %>% mutate_if(is.factor, as.character)
    severity_mapping_df$Severity[severity_mapping_df$Severity == tree_cut_number] <- severity_cohort
    return(severity_mapping_df)
}
In [23]:
# The cuts are organised in the order [2,4,3,1], Thus, to select 2nd and 3rd tree cut, we will be selecting 3 and 4
high_severity_patients_df <- generate_severity_mapping_df(samples_category_list,3,'high_severity')
print(paste('High Severity Patients are: ',nrow(high_severity_patients_df)))
low_severity_patients_df <- generate_severity_mapping_df(samples_category_list,4,'low_severity')
print(paste('Low Severity Patients are: ',nrow(low_severity_patients_df)))
[1] "High Severity Patients are:  41"
[1] "Low Severity Patients are:  50"
In [24]:
all_patients_metadata_df <- bind_rows(list(high_severity_patients_df,low_severity_patients_df))
rownames(all_patients_metadata_df) <- all_patients_metadata_df$Sample

Differential Expression¶

Differential Expression is a statistical method used to compare the difference in expression of genes between two biological conditions or cohorts. We will be using limma to do differential expression.

In [25]:
# Defining function to perform limma
performing_limma <- function(dataMatrix,metadata,cohortCol,cohortA,cohortB,pvalCorrectionMethod){
    ### Subsetting the metadata to have the samples that belong to the cohorts specified
    metadata <- metadata[metadata[, cohortCol] %in% c(cohortA, cohortB),]
    
    ### Subsetting the counts matrix with samples that belong to the cohorts specified
    dataMatrix <- dataMatrix[, rownames(metadata)]
    dataMatrix <- dataMatrix[!apply(dataMatrix, 1, anyNA), ]
    
    ### Creating design matrix
    condition <- metadata[,cohortCol]
    condition[condition == cohortA] <- 'A'
    condition[condition == cohortB] <- 'B'
    design <- model.matrix(~ condition + 0)
    colnames(design) <- gsub("condition", "", colnames(design))
    
    ### Creating contrast matrix
    contrast_matrix <- limma::makeContrasts(contrasts = c(paste("B", "A", sep = "-")), levels = design)
    
    ### Fitting linear model to log2 normalised expression data
    fit <- limma::lmFit(dataMatrix, design)
    fit <- limma::contrasts.fit(fit, contrast_matrix)
    fit <- limma::eBayes(fit)
    limma_results_df <- limma::topTable(fit, coef = paste("B", "A", sep = "-"), number = nrow(dataMatrix))
    
    ### p value correction
    for (method in pvalCorrectionMethod) {
        if (method == "Bonferroni") {
            p_val_correct_method <- "bonferroni"
        }else {
            p_val_correct_method <- "BH"
        }    
        limma_results_df[, method] <- p.adjust(limma_results_df$P.Value, method = p_val_correct_method)
    }
    return(limma_results_df)
}
In [26]:
high_vs_low_severity_diff_exp <- performing_limma(counts_matrix,all_patients_metadata_df, 'Severity', 'low_severity', 'high_severity', 'BH')
head(high_vs_low_severity_diff_exp)
A data.frame: 6 × 7
logFCAveExprtP.Valueadj.P.ValBBH
<dbl><dbl><dbl><dbl><dbl><dbl><dbl>
ZNF821-0.25552675.908129-6.8712727.250537e-109.000817e-0611.9260819.000817e-06
REXO4-0.23532626.521310-5.8735656.635191e-084.118463e-04 7.8160434.118463e-04
CHRNA6-0.26922936.323099-5.7363781.208910e-075.002471e-04 7.2701875.002471e-04
CXCR1-0.25274136.608862-5.6179382.019418e-076.267263e-04 6.8034696.267263e-04
LMOD1-0.36463776.482519-5.5360412.871550e-077.129485e-04 6.4833367.129485e-04
ICOSLG-0.26594465.608609-5.4615043.948119e-078.168658e-04 6.1938868.168658e-04

A volcano plot is a standard visualization method used for summarising the results of differential expression. The plot simultaneously assesses the statistical significance (p-values) and measure of effect or biological difference (log ratios), which separates genes differentially into two categories, significant and non-significant. This is based on significance and up-regulated and down-regulated based on the direction of effect (log fold changes).

In [27]:
plotting_volcano <- function(diff_exp, pvalCutoff = 0.05, logFCThreshold = 0.5){
    diff_exp$significance <- 'Not Significant'
    diff_exp[which(diff_exp$P.Value < pvalCutoff & abs(diff_exp$logFC) >= logFCThreshold),'significance'] <- 'Significant'
    diff_exp <- diff_exp[order(diff_exp$P.Value), ]
    
    volcano_plot <- ggplot(data=diff_exp, aes(x=logFC, y=-log10(P.Value), col=significance)) + 
        geom_point() +
        xlab("log2 fold change") + 
        ylab("-log10 (p-value)") +
        theme_minimal()
    
    return(volcano_plot)
}
In [28]:
volcano_plot <- plotting_volcano(diff_exp = high_vs_low_severity_diff_exp)
volcano_plot

Pathway Enrichment using GSEA¶

Gene Set Enrichment Analysis (GSEA) is used to interpret the functional profile of the gene set so as to get an insight about the underlying biological processes.

We use the fgsea package available in R to perform GSEA analysis.

In [29]:
# defining the function to perform fgsea
performing_fgsea <- function(diff_exp, gmtfile, pvalcutoff, pvalcolumn,logFCcolumn, minSize, maxSize, nperm){ 
    pathways <- gmtPathways(gmtfile)
    diff_exp <- diff_exp[diff_exp[,pvalcolumn] < pvalcutoff,]
    diff_exp <- diff_exp[order(-diff_exp[logFCcolumn]),]
    ranks <- setNames(diff_exp[,logFCcolumn], rownames(diff_exp))
    fgseaRes <- fgsea(pathways, ranks, minSize = minSize, maxSize=maxSize, nperm=nperm)
    return(fgseaRes)
}

GSEA requires a set of genes to pathway mappings available in the form of a gmt file. For this tutorial, we have used the Gene Ontology Biological Processes gmt file from Broad Institute.

In [30]:
# performing gsea.
fgseaRes <- performing_fgsea(high_vs_low_severity_diff_exp,'c5.go.bp.v7.2.symbols.gmt',
              pvalcutoff = 0.05,pvalcolumn = 'P.Value',logFCcolumn = 'logFC',minSize = 15,maxSize = 500,nperm = 500)
head(fgseaRes,2)
dim(fgseaRes)
A data.table: 2 × 8
pathwaypvalpadjESNESnMoreExtremesizeleadingEdge
<chr><dbl><dbl><dbl><dbl><dbl><int><list>
GO_REPRODUCTION 0.29518070.59075290.2266451.1171642146242EDNRB , CFTR , SLC26A3, BMP4 , ADAM28 , TCF21 , HSD17B2, HSD11B2, PSG6 , AKR1C3 , PI3 , STK4 , HPGD , DEFB1 , BCL2L1 , TSPAN8 , WNT5A , KRT19 , CCR6 , TLR3 , PBX1 , PDGFRA , MSX2 , SIX5 , CEBPA , HSPA2 , SOX9 , NUDT1 , LGALS9 , EPAS1 , PTCH1 , LEF1 , SPINT1 , EGFR , ESR1 , CLIC5 , CNR1
GO_REGULATION_OF_DNA_RECOMBINATION0.71229050.87314900.2578820.8006788254 16ACTR2 , PRDM9 , KMT5B , RAD50 , ERCC2 , STAT6 , SETD2 , FOXP3 , SIRT6 , CD40 , TBX21 , MSH6 , MMS19 , BLM , TIMELESS, WRAP53
  1. 1509
  2. 8

Selecting the top 10 negatively and positively enriched significant pathways.

In [31]:
topPathwaysUp <- fgseaRes[NES >= 0 & pval < 0.05][head(order(pval), n=10)] # Top 10 up pathways with pval less than 0.05
topPathwaysDown <- fgseaRes[NES < 0 & pval < 0.05][head(order(pval), n=10)]# Top 10 down pathways with pval less than 0.05
topPathways <- rbind(topPathwaysUp,topPathwaysDown)
head(topPathways,2)
dim(topPathways)
A data.table: 2 × 8
pathwaypvalpadjESNESnMoreExtremesizeleadingEdge
<chr><dbl><dbl><dbl><dbl><dbl><int><list>
GO_RESPONSE_TO_BIOTIC_STIMULUS0.0019960080.093831610.39235001.9620710271HLA-DRB4, CXCL14 , EDNRB , MUC2 , JCHAIN , ZG16 , DMBT1 , PTK6 , SLPI , IGLC1 , CEACAM1 , TRIM31 , ISG15 , IFI27 , MNDA , CR2 , IGKC , IFIT1 , PLAC8 , PI3 , VNN1 , CD55 , RSAD2 , DDX60 , IFI6 , OAS1 , HPGD , MX1 , DEFB1 , MUC13 , TRIM22 , BANK1 , BCL2L1 , WNT5A , PSMB8 , CLDN1 , SERINC3 , LYZ , RTP4 , PYCARD , TLR3 , PRSS2 , OAS2 , PLSCR1 , PRSS3 , PSMB9 , TP53 , MX2 , IFI35 , HLA-F , CD24 , MFAP4 , IFIH1 , PTGER2 , CCL11 , HLA-B , LGALS9 , APOBEC1 , APPL2 , ZBP1 , IL22RA1 , CXCL1 , TXNIP , ERAP1 , CNR1 , ISG20 , DDX58 , EIF2AK2 , CORO1A , USP18 , OAS3 , NMI , ZC3HAV1 , EPHA2 , A2M , HNMT , ACTR2 , GNG12 , RIOK3 , HERC6 , PSMB10 , CAPN2 , KIF16B , LITAF , HLA-G , LGALS3 , KRT8 , IRF9 , HLA-A , CCR7 , ZNF175 , ELF4 , TRIM14 , TIGAR , CDHR2 , HLA-E , MST1R , TRIM38
GO_IMMUNE_EFFECTOR_PROCESS 0.0020000000.093831610.40899632.0092930222OLFM4 , CEACAM6 , GCNT3 , DMBT1 , SLPI , IGLC1 , CEACAM1 , CDH17 , AOC1 , CXCR2 , ISG15 , IGLV1-44, IFI27 , MNDA , CR2 , IGKC , CEACAM3 , IFIT1 , PLAC8 , VNN1 , CD55 , RSAD2 , DDX60 , PTPRN2 , IFI6 , OAS1 , MX1 , TRIM22 , BCL2L1 , FOXF1 , WNT5A , TCN1 , SERINC3 , LYZ , RTP4 , CCR6 , PYCARD , TLR3 , PRSS2 , OAS2 , PLSCR1 , PRSS3 , TP53 , IL5 , MX2 , HLA-F , MFAP4 , IFIH1 , CLC , HLA-B , LGALS9 , APOBEC1 , APPL2 , ATP8A1 , ZBP1 , CXCL1 , LEF1 , CTSA , ISG20 , DDX58 , CTSS , BTN3A3 , EIF2AK2 , CORO1A , RHOF , OAS3 , ZC3HAV1 , A2M , JUP , ACTR2 , MVP
  1. 20
  2. 8

We shall visualize the top 10 positively and negatively enriched pathways

In [32]:
# Summarising results using barplot.
topPathways$pathway <- factor(topPathways$pathway, levels = topPathways$pathway[order(topPathways$NES)])
plotDims <- options(repr.plot.width=12, repr.plot.height=8)
fgsea_viz <- ggplot(topPathways, aes(x = pathway, y = NES)) +
                geom_col(aes(fill = NES), width = 0.7) +
                labs(x = 'Pathways', y = 'Normalized enrichment score') +
                theme_minimal() +
                theme(
                    axis.text.x = element_text(size = 12),
                    axis.text.y = element_text(size = 12)) +
                coord_flip()
fgsea_viz
options(plotDims)