Data, data everywhere, not a byte to use!
2 trillion GB of data is generated every year; however, 80% of the data being generated is unstructured and thus unusable. In other words- Biomedical data is unFAIR.
A vast amount of biological multi-omics data is generated worldwide at any given point; this data has enormous potential for discovery and reusability for various R&D projects; however, it is extremely hard to search and keep up with all the newly emerging data.
Polly is a data-centric MLOps platform for biomedical data that provides access to FAIR (Findable Accessible Interoperable and Reusable) multi-omics data from public and proprietary sources. Data from across various sources is harmonized and curated using ML models, ensuring that it is machine-actionable and analysis-ready. Polly’s cloud infrastructure enables seamless data analysis, visualization, and sharing by offering a toolbox of scalable, easy-to-customize bioinformatics pipelines.
Data is ingested from different sources like publications and databases (public like GEO or proprietary), and made machine-actionable on Polly. All datasets are stored in a consistent file format that is analysis-ready.
Every source has different protocols for accessing the data. One way would be to manually download the data and keep it in our infrastructure. But that would make data untraceable and we would need to manually keep track of new datasets. Specific ETL pipelines called connectors are designed to help solve these challenges to a great extent.
Connectors enable us to download datasets from a particular source and keep track of any new datasets. Apart from downloading data, a connector is also responsible for data harmonization i.e. the process of combining data of varying file formats, naming conventions, and columns, and transforming it into one cohesive data set. Seamless data ingestion and metadata harmonization are facilitated using ETL pipelines.
Metadata annotation is a crucial process to improve the quality of datasets. There are more than a million datasets currently present on Polly. It won’t be a very scalable approach if we manually annotate all the datasets present on Polly. We thus use MLOps pipeline that annotates most of our datasets automatically.
BERT model has been one of the widely accepted models in NLP benchmarks that makes it spread to various tasks in Natural language processing (NLP). These language models help to scan through biomedical literature and extract information which is later used to enhance search. PollyBERT - built on top of BERT, enriches the way we access metadata from various data sources.
A central pillar of PollyBERT (Polly’s curation infrastructure) is the use of ontologies and controlled vocabularies for annotation of metadata fields such as disease, organism, cell line, tissue, cell type, drugs, genotypic perturbation, chemical perturbation, etc. Access to these annotations gives users powerful mechanisms to query this data. Through our curation pipeline, the metadata is harmonized using ontologies and the data is saved in accessible formats either as gct files which support a lot of omics and non-omics data, or as h5ad files which support larger, complex data like single-cell RNAseq.
Manual curation infrastructure generates training data and that training data is being used to create these machine learning models. These machine learning models are deployed on AWS Sagemaker and can be accessed via APIs.
The clean, curated and annotated data is stored in a repository on Polly called OmixAtlas.
OmixAtlas is a collection of millions of datasets from public, proprietary, and licensed sources that have been curated, harmonized and made ready for downstream machine learning and analytical applications. It is one central location to access data over 26 data types from over 30 public repositories and licensed sources. Our offerings can be categorized as Public OmixAtlas or Enterprise OmixAtlas.
These datasets can be accessed through GUI or programmatically with Polly Python. Computational requirements can be scaled based on the complexity of the job using Polly's notebooks, dockers, and machine types.
Polly-python is a library, which makes it easy for the users to search and access rich multi-omics data linked with metadata.
With Polly-python one can:
Polly Notebook is a scalable analytics platform that allows us to perform data analysis remotely in a Jupyter-like notebook. It provides the flexibility to select the compute capacity, and the environment as per our needs.
Polly CLI (Command Line Interface) is a tool that enables bioinformaticians to interact with Polly services using commands in your command-line shell. It lets us upload data and run jobs on the Polly cloud infrastructure by scaling computation resources as per need. Further, it also allows the user to start and stop jobs, monitor them, and view logs.
Contact us if you want to learn more about using our 1.5 million curated datasets to train your models or to take advantage of our data-centric platform Polly, to find and analyze relevant datasets.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly provides pre-processed, harmonized datasets that enable AI/ML model training for patient classification. It supports feature selection, dimensionality reduction, and validation workflows to build robust predictive models for precision medicine applications.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly analyzes both single-cell and bulk multi-omics data to identify stage-specific genetic markers. By applying machine learning algorithms to detect patterns in gene expression, Polly helps researchers map lineage differentiation and gain insights into disease progression.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly builds disease-specific atlases by:
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly integrates genomics, transcriptomics, proteomics, and clinical data into a unified, multi-dimensional view of patient populations. This helps researchers uncover complex biological relationships and enhances predictive modeling for patient subgroups.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Yes, Polly automatically processes raw, unstructured data from public sources, addressing missing values, batch effects, and inconsistencies. Its machine learning–driven pipelines filter out noise and standardize data, ensuring higher-quality datasets for seamless analysis.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly's harmonization engine normalizes, processes, and integrates diverse datasets using standard ontologies and metadata frameworks. This ensures consistency, removes batch effects, and enhances the reliability of downstream analyses for precise patient classification.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly streamlines patient stratification by:
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Researchers encounter several challenges, including:
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Patient stratification is the process of categorizing patients into subgroups based on genetic, molecular, or clinical characteristics. This approach is crucial for precision medicine because it identifies which patient populations are most likely to respond to specific treatments, thereby improving therapeutic outcomes and reducing the risk of adverse effects.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly provides access to a curated repository of RNA-seq datasets that are consistently processed and enriched with metadata. This harmonization allows researchers to efficiently search for datasets with similar transcriptional profiles, facilitating transcriptome profiling and biomarker identification.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly utilizes signature reversal and multivariate gene expression signatures to predict potential drug combinations. By analyzing publicly available transcriptomics data and drug signatures, Polly can identify drugs or compounds that may have therapeutic effects by reversing disease signatures.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly ranks similar datasets using cosine similarity scores, which measure how closely a dataset's transcriptional profile matches the query signature. This helps researchers quickly find relevant datasets for further analysis and validation.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Researchers define the biological process of interest, select a dataset, preprocess the data, identify differentially expressed genes, and validate the signature. Polly’s platform streamlines this process with expert support and ML-ready datasets.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly's RNA-Seq Atlas addresses challenges in extracting associated signatures from public databases by providing a curated resource of RNA-seq datasets collected from the Gene Expression Omnibus (GEO). This richly curated resource helps researchers to find datasets with similar transcriptional profiles to their gene sets of interest.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Gene signature comparison analyzes gene expression patterns to identify disease-related signatures. It helps researchers find drugs that can reverse disease signatures, aiding in therapeutic discoveries.