As we continue to equip our academic and pharmaceutical partners with machine-actionable biomolecular data, we’re focused on meeting the needs of multi-disciplinary R&D teams across the drug discovery process. Polly’s recent updates include a range of features and optimizations that simplify data discoverability, annotation and harmonization, and integrative data analysis.
OmixAtlas: A comprehensive resource of ML-Ready Omics Data
OmixAtlases on Polly comprise data from several publications & controlled repositories, curated through a standardized pipeline & ready for ML-based workflows. These Atlases encapsulate sample & dataset level metadata fields, consistent ontologies, and samples divided into perturbation & control. This depth & breadth of metadata curation enable intuitive point & click filtering as well as in-depth data exploration through Polly Python on a preferred infrastructure.
In this latest release, data from the following public repositories were made available as OmixAtlases on Polly: GEO, GDC, cBioPortal, Metabolomics, LINCS & PharmacoDB.
Table View for Datasets
Previously, datasets on the OmixAtlas were presented as a list of information panels comprising color-coded metadata fields. Users were limited to sorting by Dataset ID & Number of Samples, reducing the scope for search on Polly.
Structured data from the OmixAtlas is now organized into tables, with metadata fields presented as columns. Datasets can be sorted using the following metadata fields – Dataset ID, No. of Samples, Tissue, Data Type, Disease, Cell Line, Cell Type & Drugs. These sorting options, along with metadata filters allow for a great degree of granularity in your search for relevant datasets.
In this release, we also see the option to download dataset files (gct, h5ad, vcf) of interest, allowing users to perform downstream analysis on a computational environment of their choice. Switch to card view and navigate to the options button to access the download feature.
Clean, usable & normalized data matrices are predominantly unavailable across public databases, presenting monumental challenges related to data findability & usability.
Another caveat to finding relevant datasets is the task of assembling accurate metadata information about the datasets & their samples (Tissue of origin, treatment condition, the name of the cell line, and more), requiring hard-to-train skills in SQL/a prior knowledge of relevant packages. In an effort to cut down on the effort, time & resources required to discern whether a dataset is valuable to your research question, we’ve introduced a details page attached to the Datasets displayed on the OmixAtlas.
The page comprises a dashboard showcasing the distribution of samples across some key metadata attributes – disease, cell type, cell line, Tissue & Genetic Modification. It also presents an overview of the dataset, containing information such as Title, publication, Abstract as well as Metadata tags attributed to the dataset.
Further, a comprehensive sample level metadata table was also assembled, presenting a snapshot of all the relevant information needed to generate insights from the dataset – Control / Perturbed Sample, Genetic Modification, Drug, Tissue, Cell Type, etc. This information is custom to the different OmixAtlases present on Polly. For instance, the Metabolomics OmixAtlas contains a sample level metadata table that includes source-specific attributes (Metabolights, Metabolomics Workbench) such as BMI & Age.
Users may also download processed data files pertaining to the datasets. These files contain an expression matrix, with normalized values for gene expression across samples. A wide range of downstream analysis techniques can be employed to this data – from generating simple heat maps on excel to running the data through complex ML workflows.
Request a Dataset relevant to your biological question
In this release, we’ve also introduced the functionality to request the addition of a dataset to public or proprietary OmixAtlases on Polly. Once the request is logged, the dataset is made available to the user in their desired location & curated to custom requirements, irrespective of its source & datatype.
For instance, users may request for the curation of additional metadata fields, process raw data with a pipeline of their choice as well as any other information required for downstream processing & analysis.
Access Curated DepMap Data on Polly
Over the last month, 2,062 curated datasets (Gene Dependency + Gene Effect) were added from DepMap, a consortium that aims to plot relationships between the genetic alterations of cancer & its dependencies. Data from DepMap allows researchers to uncover biomarkers for various types of cancers, identify therapeutic leads, and carry out patient stratification.
Why access Data from DepMap on Polly?
- Access to 30,000 datasets comprising curated Gene Dependency, Drug Screens, Gene effect & RNAi data. These datasets are enriched with harmonized metadata annotations, processed through a standard pipeline, and made available in consistent tabular formats.
- The DepMap application on Polly is an expert-vetted pipeline that enables users to process gene effect & gene expression matrices, and filter expression data for further downstream analysis. This includes:
– Visual representation of data using Heatmaps & PCA
– Differential Analysis to uncover differentially expressed features (genes, metabolites, or Proteins) associated with predictors and drug responses
– Distribution of genes of interest across lineages with the help of Density & Box Plots.
Evolving breadth & depth of metadata annotations
An internal benchmark study, conducted by curation experts at Elucidata, suggests a dramatic increase in search results across datasets on Polly (~300%) vs. Public sources. Additionally, one can build cohorts of interest 30 times faster on Polly, in comparison to source repositories. These improved numbers can be attributed to Polly’s proprietary curation models, that generate harmonized metadata annotations, tag each sample with controlled vocabulary/ontologies & process datasets with uniform molecular identifiers.
Currently, over ~4.1 million samples on Polly have been annotated with ~35 Million auto curated labels spanning 14 Dataset & Sample level fields:
- Dataset ID
- Data type
- Cell line
- Cell Type
- Dataset Source
- Genetic Modification Type
- Gene Target
- Total No. Of Samples
Over the last month, 3 new metadata fields – Age, Gender & Genotype were curated & tagged to 3.9 samples from the GEO OmixAtlas on Polly.
Sorting on Workspaces
In order to parse through projects & data more efficiently, Polly Workspaces now allows you to control how files are listed on the content panel through a variety of sorting options.
Earlier, items on workspaces could be filtered according to File Type – Files, Notebooks, Reports & Analysis. In this release, users can sort across several new dimensions, including Name, Type, Last Modified, Size of the File, File Creation Date & Author. Access the latest versions of notebooks, sort through computationally intensive data & find analyses performed by relevant stakeholders seamlessly.
This sorting functionality also extends to files within folders on a workspace.