When bioinformaticians go walking in the woods, they DON’T take the road less traveled – they take the reliable, well-tested route that is sure to get them home. When it comes to relying on predictions made by computational models, biologists often find it wise to keep their skeptic caps on, and rightly so. Today, biomedical molecular data lies at the heart of these critical decisions.
Due to a large amount of data available today, along with well-established machine learning algorithms, the design of largely automated drug development pipelines can now be envisioned. These pipelines guide, or speed up drug discovery, provide a better understanding of diseases and associated biological phenomena, help plan preclinical wet-lab experiments, and even future clinical trials. Here too, data plays a critical role in solving the current issue of the low productivity rate that pharmaceutical companies currently face2.
The reliability and confidence factor of methods, datasets, and the resulting inferences play a crucial role in backing any discovery. But how do research organizations begin to ensure a high-reliability factor when it comes to the data they utilize?
Data-informed vs. Data-driven
Most research teams that recognize the value of good data can be described as data-informed or data-driven.
Being ‘data-informed‘ is the lowest level of data integration – lab members collect data from various sources, organize and document it. The team has access to the data, and the decision-makers use dashboards or Excel sheets to maintain their datasets. Such research groups use data but don’t employ data science or rely on machine learning to drive critical decisions. For example, data-informed teams could refrain from depending on automated infrastructures to obtain public data.
Being ‘data-driven’ is the next level of data integration. Data-driven research teams understand the potential of good data competently and better recognize and utilize the scale and variety of public data available today. Although there are degrees to being data-driven, such teams do not shy away from relying on better-curated datasets, algorithms, and automated pipelines to make data-led decisions.
Is Being Data-Driven Enough?
Although data-driven teams see a lot of success, they are often far from achieving the full potential of the data they handle. In fact, if implemented incorrectly, a data-driven approach may defeat its purpose. Independently creating high-quality, high-value data lakes is time-consuming. Further, this mimics creating a data silo that is limited by an exclusive infrastructure2.
Well-curated datasets remain under-utilized in such a system. Algorithms and automated pipelines are developed and dismantled without being utilized as a shared resource. Additionally, it is hard to reproduce workflows and design machine learning models around a system that is clean and curated but lacks universal usability.
Data-Centricity Makes All the Difference
Comparatively, a data-centric approach integrates data as the necessary centerpiece at multiple levels of the data analysis and interpretation cycle. It marks the highest level of data integration that infrastructure can achieve. A data-centric laboratory’s architecture is built on the reliability and reusability of its core model. Additionally, this disrupts the problem of creating information silos.
The root of the information-silo problem is the application-centric approach. Research laboratories create projects or buy equipment to respond to a specific need with a deadline, ignoring other variables such as data control, quality, or a reusable architecture. Whereas, a data-centric architecture has a permanent and primary core: data. Applications and analysis pipelines are ephemeral; they live as long as they are useful. But data endures.
Data-Centric Approach for Biological ML Models
The promise of AI-driven drug design carries several challenges – the need for appropriate datasets, the ability to generate and test evolving biological hypotheses, multi-parameter optimization, reduction in design-make-test-analyze cycle times, and adaptability of research culture. Amidst this, a data-centric approach recognizes the value of high-quality data that significantly improves the data training efficiency of ML-driven drug design models. According to Andrew Ng, good data is defined consistently, covers important cases, has timely feedback from production data, and is sized appropriately.
With a data-centric system, healthcare enterprises utilizing ML methods quickly accumulate a large quantity of quality data to make reliable predictive models. Such systems can learn from this data and help decision-makers get faster valuable insights into their research problems3. The use of consistent data is paramount. Tools like Polly, which give access to quality datasets, allow multiple models to do well.
Data has empowered drug discovery like never before. We must have machines at our side if we’re to find patterns in the modern-day data deluge. To that end, both high-quality data and a data-centric approach in the laboratory are of paramount importance.
- Pammolli, F., Magazzini, L., & Riccaboni, M. The productivity crisis in pharmaceutical R&D. Nature reviews Drug discovery, 10(6), 428-438 (2011).
- Khatami, S. G., Mubeen, S., & Hofmann-Apitius, M. Data science in neurodegenerative disease: Its capabilities, limitations, and perspectives. Current opinion in neurology, 33(2), 249 (2020).
- Talevi, A., Morales, J. F., Hather, G., Podichetty, J. T., Kim, S., Bloomingdale, et al. Machine learning in drug discovery and development Part 1: a primer. CPT: pharmacometrics & systems pharmacology, 9(3), 129-142 (2020).