“Data! Data! Data! I can’t make bricks without clay.”- Sir Arthur Conan Doyle.
Ask Sherlock Holmes about the importance of good data when it comes to resolving a conundrum. Most data scientists would agree- building a robust Machine Learning (ML) model without good datasets is challenging.
ML integration in clinical studies has increased as biomedical scientists encounter exponential growth of data volumes and advancements in computing infrastructure. ML is now being deployed on biomedical omics data to predict patient risk stratification for clinical trials and drug target validation, amongst other applications. But is our model-centric approach towards ML advantageous?
Model Centric vs. Data-Centric Approach
Data scientists use only 20% of their time training ML systems; the rest is used to retrieve and prepare datasets for the ML process. Despite this, most industrial and academic research labs follow a model-centric approach towards ML. According to this approach, datasets remain fixed while codes are optimized to improve model performance. A data-centric system, on the other hand, consists of data that improves through iteration. In the context of biomedical omics data, this proves to be a challenging task due to the unavailability of large, uniformly curated datasets.
What does quality omics data look like anyway?
Publicly available omics data is often located at multiple sources. Different data sources may follow various labeling conventions for patient samples, gene annotation schemes, and file storage formats. Considerable efforts and resources are put into combining and preparing such datasets for further analysis and ML integration.
When dealing with hundreds of omics datasets, large numbers of inconsistent labels prove harmful to ML systems. Such ML systems learn to identify the patterns created by erroneous data as correctly labeled. A data-centric approach, in this condition, focuses on selecting datasets with consistent and accurate labels. Additionally, good quality omics datasets have good coverage of important cases, improve from timely inputs from production data, and are sized appropriately.
Quality data at the core of MLOps
A focus on utilizing high-quality omics datasets reaps essential benefits to bioinformaticians and data scientists working in clinical settings. Often, improving the quality of datasets readily enhances the accuracy of data interpretation without the hassle of fixing too many codes. Additionally, quality biomedical datasets set the stage for a positive feedback loop for better model training in the machine learning life cycle.
According to AI pioneer and technology entrepreneur Andrew Ng, systematically optimizing an ML life cycle with data-centric tools and processes yields a superior model performance. Thus, developing MLOps tools that support a data-centric approach is elementary to significant biomedical advancements.
- Mirza B., et al. Machine Learning and Integrative Analysis of Biomedical Big Data. Genes. vol. 28.10(2):87 (2019).
- Sagar R. Big Data To Good Data: Andrew Ng Urges ML Community To Be More Data-Centric And Less Model-Centric. Analytics India Magazine (2021).