With the increasing shift towards data-driven research by R&D teams in the pharmaceutical industry, the implementation of machine learning models has gained momentum. To realize the promise of ML initiatives for insight discovery, the availability of high-quality data has become an absolute prerequisite. Our platform Polly, is geared towards delivering machine-learning ready data as well as enabling scientists to derive maximum value from the data. In this blog, we present the IDEATE framework, the foundational principles that guide our vision of solving data-centric problems to empower data-driven drug discovery teams.
The Current State of Biomedical Data
Public data is available, but not usable.
Currently, public omics data availability is on the rise despite variable adoption and an evolving landscape.
The number of new datasets added in the public domain every year is impressive but pales in comparison to the number of publications every year since many studies are still being conducted without contributing to public data 1. Despite many researchers and pharmaceutical companies still not making their data publicly available, omics data in the public domain continues to grow at breakneck speed2. Advancements in technologies and reduction in the cost of sequencing have enabled institutes such as the Broad Institute to be more ambitious and generate omics data in quantities never seen before.
An abundance of clean data is a prerequisite for machine learning.
With the increase of data, we also see a corresponding rise in publications focused on developing an algorithm that can be used to derive better insights. All these publications have two things in common – they clean the data first, and they apply their algorithms on a small number of datasets. In fact, a few datasets are cited more often than others, and the number of these citations often correlates with whether clean data was available, and how good and open the repository is. One of the most prominent examples of what clean data can accomplish is AlphaFold, which was trained on the largest clean and curated data of its kind. At present, ML models and algorithms are making huge strides in the use of omics datasets, and this will require a lot of curated data.
The use of public data is not institutionalized, often limited to one type of omics, and is typically only used in the context of understanding self-generated data.
Scientists have just started to use multi-omics datasets that are publicly available. Right now, the use of public datasets is highly skewed towards transcriptomic datasets. Transcriptomics datasets are the most highly cited, with more than 30,000 datasets, each with 11 citations2; while in the case of genomics, proteomics, and metabolomics, most datasets are only cited once2. This might be a reflection on the FAIR practices3 that are being followed for each kind of dataset but we don’t know for sure. Thus, the world today, is at the cusp of a public data revolution; its trajectory and impact will depend on what kind of insights public data will provide in the future.
Data Consumption of Tomorrow
- The first step in starting a new research arm in a pharmaceutical company or a new biotech company will be to mine existing data, rather than to perform wet-lab experiments to generate new data. Making effective use of existing data to generate initial hypotheses can speed up pre-clinical research and reduce the overall cost of a research program.
- Publicly available data will play a significant role in validating hypotheses that scientists have generated based on their previous work. This will be used to enrich their existing data, leading to newer insights.
- There is a consensus in the research community that drug discovery and associated research is, fundamentally, a systems biology problem requiring the integration of many different types of data. Therefore, the data that will be the most sought after will be diverse and include multi-omics data, clinical data, phenotypic data, etc.
- AI will be used to automate the routine, low-level analyses and will help scientists draw higher-level insights from the underlying data. To increase the trust of the community in the available data and to encourage its use, the AI algorithms and the underlying data should be transparent and not a “black box”.
- Multi-disciplinary teams of research scientists and computational scientists will be working closely together with more frequently than they currently do.
The IDEATE Framework
The IDEATE framework, detailed below, defines the key aspects of the data-driven problems that we solve through our platform, Polly.
Our Data Connectors bring data to Polly from controlled repositories, publications or your own proprietary data at scale. With our ML-driven Curation Infrastructure, data enrichment occurs in an automated manner. End-to-end data pre-processing (identifier mapping, alignment, normalization, quality check) is orchestrated on our Analysis Pipeline.
The depth of data curation enables both intuitive point-and-click filtering as well as advanced querying through code, across our entire data catalog of OmixAtlases. Polly Libraries lets you perform complex querying across different types of datasets, samples, and specific features.Our curation pipeline provides rich and harmonized metadata annotations with scientific context and high accuracy, streamlining your search for data.
Enrich & Analyse
Analyze ML-ready data by hosting applications and running Notebooks on Polly’s robust computational infrastructure. Polly Libraries offer a high degree of flexibility in slicing and dicing data using code, enabling integrative data analysis. Access and integrate data from Polly through Libraries on your own computational infrastructure.
In-built apps and customizable visualization dashboards enable easy data interpretation. Generate one-click custom reports for your data, code, analyses and results that you can seamlessly share with your team. Workspaces on Polly lets you organize and manage your data in a secure environment.
1. Rustici, G. et al. Transcriptomics data availability and reusability in the transition from microarray to next-generation sequencing. bioRxiv 2020.12.31.425022 (2021) doi:10.1101/2020.12.31.425022.
2. Perez-Riverol, Y. et al. Quantifying the impact of public omics data. Nat. Commun. 10, 1–10 (2019).
3. Wilkinson, M. D. et al. Comment: The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 1–9 (2016).