Faster Insights on Omics Data Signatures with Polly Discover

Yogesh Lakhotia, Omnya Mohamed Izzeldin
February 12, 2024
Faster Insights on Omics Data Signatures with Polly Discover
What are the upregulated and downregulated genes in response to a treatment?
Are there specific gene signatures associated with a disease subtypes or stages?
How are signalling pathways affected by genetic mutations?
How does your in-house data compare with the publicly available data?

These are fundamental questions when researching gene expression data to identify candidate genes and biomarkers associated with diseases. However, addressing these questions using public databases is highly non-trivial. Data quality and variability remain persistent concerns due to variations in experimental protocols, sample sizes, and platform differences. These factors introduce noise and bias, akin to searching for a needle in a haystack when attempting to find and extract meaningful gene signatures from the available data.

Challenges While Exploring Public Bulk RNA-seq Data

Bioinformaticians face several challenges when exploring publicly available bulk RNA-seq data. These challenges arise from the complexity and volume of the data, as well as the need to ensure data quality and extract meaningful biological insights. A few notable roadblocks include:

  • Data Heterogeneity: Publicly available RNA-seq data often come from different laboratories, platforms, and experimental conditions. This heterogeneity makes it difficult to compare and integrate datasets effectively.
  • Inconsistency in Data Quality and Preprocessing: For instance, GEO (Gene Expression Omnibus) includes a multitude of gene expression profiles from various experiments, platforms, and sources. Of these, only 2.9% of the records (or studies in layman’s terms) have been curated retrospectively.  Researchers must apply rigorous quality control measures and preprocessing steps to make data suitable for analysis.
  • Lack of Transparency: Inadequate documentation and clarity in data processing and analysis pipelines pose challenges to the interpretation, optimization, and comparability of RNA-seq data across studies, potentially undermining its reliability and utility in scientific research.

Our Solution: Polly

Elucidata's data harmonization platform, Polly, tackles the challenges of data heterogeneity in open-source  databases by integrating and standardizing diverse datasets. Polly ensures data quality through rigorous preprocessing and provides transparent documentation of the analysis pipelines, enabling researchers to derive reliable insights efficiently.

Omics Data Signatures
How does Polly make Data ML-ready?
Feature Description
Metadata Harmonization and Data Standardization. Polly's harmonization engine standardizes and harmonizes data related to samples and experimental conditions.
Stringent Quality Checks in Data Ingestion and Processing. Rigorous quality checks during data ingestion and processing stages to identify and rectify errors or anomalies.
Customizable Processing Daa processing pipelines can be tailored to meet the unique requirements of different research projects and applications.
Ensuring Transparency in the End-to-End Process 1. Documentation of steps, parameters, and methods applied to the process.
2. Facilitates understanding and reproducibility of analyses.

These high-quality datasets form a solid foundation for extracting relevant molecular signatures. For further exploration and analysis of these signatures, the platform also provides Polly Discover.

What is Polly Discover?

Polly Discover is an analysis module on the platform, to help users extract, find, and explore biologically important signatures from relevant curated datasets, as well as comparisons (of cohorts) within datasets. The module provides interactive visualizations that facilitate the interpretation of expression results. Users can enhance these results by incorporating existing knowledge bases and integrating them into meta-analysis methods, machine learning applications, and other tools. For those seeking more advanced visualizations, the data can be streamed to tools like Spotfire using APIs.

Polly Discover -  Key Features

  • High-quality metadata curation custom to research needs. Human readable comparison names segregated into appropriate categories to ease findability.
  • Full control over data processing pipelines used. Ensure all data is comparable with inhouse findings.
  • 360-degree findability journeys ( based on genes, pathways and other metadata fields) to search across public, in-house data
  • Fast turnaround times / predictable delivery timelines with tech-enabled processes.
  • Discover robust and consistent gene expression signatures across various comparisons.
  • Integrate with other open-source knowledge bases seamlessly to enrich signatures.
Omics Data Signatures
Polly Discover Workflow

Use Case: Finding the Gene Signatures Associated with Ulcerative Colitis in a Few Clicks.

A researcher studying ulcerative colitis aimed to identify specific gene signatures linked to the disease. By comparing their in-house bulk RNA-seq data with publicly available information, they sought to validate their findings and pinpoint potential targets with greater confidence.

For starters,  data audits have been performed on datasets from sources such as GEO and ArrayExpress to find all the ulcerative colitis-related datasets and store them in an Atlas. Both public and in-house data were processed using the same pipeline enabling users to generate and compare insights from both public and in-house data seamlessly.

With Polly Discover,

  • The datasets were deeply curated with Polly Harmonization Engine to make the following key fields available to the users - disease, tissue, drug, cell-line, cell type, mouse/rat strain, experimental factors, comparison types, etc. This curation enabled users to find relevant curated datasets within minutes.  
  • Each dataset was carefully curated to identify relevant groups and suitable comparisons. For instance, within the GSE112057 dataset, comparisons included Crohn’s Disease vs. normal, Crohn’s Disease vs. colitis, and Polyarticular Arthritis vs. colitis, among others. Using DESeq2, differentially expressed genes and enriched pathways from MSigDB for each of these comparisons are already precomputed and stored in Polly’s Atlas. This streamlined approach makes it convenient and efficient to identify gene signatures and grasp the functional significance of these differentially expressed genes.

In this case study, we picked 5 datasets where ulcerative colitis samples are compared with normal samples. Here’s how one dataset can be consumed with the Polly Discover on Polly-

Omics Data Signatures
Metadata Curation on Polly

A curated comparison study enables identifying genes that are known to be biologically relevant to Ulcerative Colitis, here there are 55 Control Samples and 43 Perturbation Samples with 837 upregulated genes.

Omics Data Signatures
Visualize curated comparisons within the dataset

Further analysis of the differentially expressed genes in the dataset can be done by visualizing a volcano plot of genes and its associated log fold change value and p-value. The Gene List can be downloaded and compared to the in-house propriety bulk-RNAseq data for validation.

Omics Data Signatures
Volcano Plot

More robust validation of in-house findings can be achieved by cross-comparing log-fold change (logFC) values across 5 datasets, this can help analyze consistent patterns of gene expression changes across datasets, and researchers can identify more reliable gene signatures associated with ulcerative colitis.

Omics Data Signatures
Upregulated genes across the datasets

Notably, all genes consistently demonstrate similar expression patterns across the various studies.

Upregulated genes across 5 datasets of comparison ' Ulcerative Colitis Vs Normal'.

This approach adds strength to the results by demonstrating the consistency of gene expression patterns across diverse studies conducted by different groups, even in the presence of heterogeneity in experimental conditions, data sources, and time points regarding Ulcerative Colitis.

With Polly Discover, identifying common genes across all curated datasets is a mere minute task. Further analysis can be done using open-source tools like GOProfiler, NetworkAnalyst, Cytoscape, etc.

Downstream step Tool
functional relevance of these genesets GOProfiler Image
Pathways that get impacted by the geneset of consistently upregulated genes
Drug repurposing NetworkAnalyst Image
Drugs that can be used for a given gene target
Gene signaling regulation NetworkAnalyst Image
Gene signaling regulation

Employing DisGeNET, researchers identified the predominant mutations in ulcerative colitis-afflicted individuals, namely NOD2, ATG16L1, IL23R, ABCB1, TNFSF15, STAT3, NR1I2, and TLR4. Their objective was to explore instances of differential expression of these genes in various biological conditions. With Polly Discover, they could search and discover 99 distinct comparisons across biological conditions where these genes exhibited differential expression.

Omics Data Signatures
A geneset search on Polly DIscover

Impact

1. By utilizing Polly Discover, the researcher were able to validate the in-house findings of their study on ulcerative colitis saving 70% of time consumed over traditional methods.

2. The researcher efficiently identified gene signatures and enriched pathways associated with the disease, enhancing their understanding of ulcerative colitis.

3. With few clicks, researchers swiftly identified 99 distinct comparisons across biological conditions showcasing the varied expression of key genes predominant mutations in ulcerative colitis-afflicted individuals

Conclusion

Polly Discover on Elucidata's Polly simplifies the complexities of transcriptomics data analysis, providing researchers with a one-stop solution. By addressing challenges in publicly available RNA-seq data, Polly Discover ensures high-quality, harmonized data for efficient exploration.

The use-case of Polly Discover is exemplified in a scenario involving the exploration of genes associated with ulcerative colitis. Through Polly's harmonizing engine, researchers can compare in-house bulk RNA-seq data with public data, ensuring high confidence in target identification. The platform's curated datasets, comparisons, and precomputed gene signatures streamline the process, offering efficient data exploration.

Connect with us or reach out to us at info@elucidata.io to learn more.

Other Resources

Talk to our Data Expert
Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.
Oops! Something went wrong while submitting the form.

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

  • Data-Driven Target Selection: Polly integrates multi-omics data to identify key genes relevant to patient subgroups.
  • Accelerated Drug Discovery: The platform prioritizes targets based on disease associations and biomarker relevance, expediting the discovery and validation process.
  • Improved Reproducibility: Harmonized datasets ensure reliable and reproducible findings for target validation.

How does Polly help in training classifier models for patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly provides pre-processed, harmonized datasets that enable AI/ML model training for patient classification. It supports feature selection, dimensionality reduction, and validation workflows to build robust predictive models for precision medicine applications.

How does Polly assist in defining genetic signatures for different stages of cell differentiation?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly analyzes both single-cell and bulk multi-omics data to identify stage-specific genetic markers. By applying machine learning algorithms to detect patterns in gene expression, Polly helps researchers map lineage differentiation and gain insights into disease progression.

What is the process of creating a disease-specific atlas using Polly’s harmonization engine?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly builds disease-specific atlases by:

  1. Aggregating multi-omics datasets from curated sources.
  2. Harmonizing data using standardized ontologies.
  3. Annotating datasets with clinical metadata.
  4. Structuring the information into disease-specific cohorts for targeted biomarker and therapeutic research.

How does Polly integrate multiple data types for more reliable patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly integrates genomics, transcriptomics, proteomics, and clinical data into a unified, multi-dimensional view of patient populations. This helps researchers uncover complex biological relationships and enhances predictive modeling for patient subgroups.

Can Polly handle data quality issues and unstructured data from public repositories?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Yes, Polly automatically processes raw, unstructured data from public sources, addressing missing values, batch effects, and inconsistencies. Its machine learning–driven pipelines filter out noise and standardize data, ensuring higher-quality datasets for seamless analysis.

How does Polly harmonize multi-omic datasets to improve the quality of patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly's harmonization engine normalizes, processes, and integrates diverse datasets using standard ontologies and metadata frameworks. This ensures consistency, removes batch effects, and enhances the reliability of downstream analyses for precise patient classification.

How does Elucidata's Polly help in overcoming the challenges of patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly streamlines patient stratification by:

  • Harmonizing and Integrating Multi-omics Data: Polly standardizes data across different sources, making it analysis-ready.
  • Curating High-quality Datasets: The platform ensures datasets are clean, structured, and well-annotated, thereby improving the reliability of downstream analyses.
  • Enabling AI-driven Insights: Polly applies machine learning models to uncover patterns and classify patients effectively.
  • Ensuring Reproducibility and Scalability
  • Automated pipelines and version-controlled workflows allow for efficient scaling to large datasets while maintaining detailed records of each analysis step, making it easier to reproduce or modify results.

What challenges do researchers face when performing patient stratification using multi-omics data?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Researchers encounter several challenges, including:

  • Data Heterogeneity: Multi-omics data come from different platforms, making integration complex.
  • Data Quality Issues: Public datasets often contain missing values, noise, or inconsistencies.
  • Computational Complexity: Large-scale multi-omics data require significant computational power and expertise to process.
  • Interpretability: Even with powerful analytical methods, extracting clear and meaningful biological insights from high-dimensional data remains a significant challenge.

What is patient stratification, and why is it important for precision medicine?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Patient stratification is the process of categorizing patients into subgroups based on genetic, molecular, or clinical characteristics. This approach is crucial for precision medicine because it identifies which patient populations are most likely to respond to specific treatments, thereby improving therapeutic outcomes and reducing the risk of adverse effects.

What are the key advantages of using Polly for transcriptome profiling and biomarker identification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly provides access to a curated repository of RNA-seq datasets that are consistently processed and enriched with metadata. This harmonization allows researchers to efficiently search for datasets with similar transcriptional profiles, facilitating transcriptome profiling and biomarker identification.

What methodologies does Polly use to identify synergistic drug combinations?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly utilizes signature reversal and multivariate gene expression signatures to predict potential drug combinations. By analyzing publicly available transcriptomics data and drug signatures, Polly can identify drugs or compounds that may have therapeutic effects by reversing disease signatures.

How does Polly rank datasets similar to a gene signature query?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly ranks similar datasets using cosine similarity scores, which measure how closely a dataset's transcriptional profile matches the query signature. This helps researchers quickly find relevant datasets for further analysis and validation.

What steps are involved in creating a query gene signature on Polly?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Researchers define the biological process of interest, select a dataset, preprocess the data, identify differentially expressed genes, and validate the signature. Polly’s platform streamlines this process with expert support and ML-ready datasets.

How does Polly's RNA-Seq Atlas simplify gene signature analysis?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly's RNA-Seq Atlas addresses challenges in extracting associated signatures from public databases by providing a curated resource of RNA-seq datasets collected from the Gene Expression Omnibus (GEO). This richly curated resource helps researchers to find datasets with similar transcriptional profiles to their gene sets of interest.

What is gene signature comparison, and why is it important in drug discovery?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Gene signature comparison analyzes gene expression patterns to identify disease-related signatures. It helps researchers find drugs that can reverse disease signatures, aiding in therapeutic discoveries.