“Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.” — Clive Humby, 2006
Data Centricity to increase the accuracy of ML predictions: A paradigm shift
The traditional approach to machine learning (ML) was to curate the data to a machine-readable level, train a model, and then fine-tune the model to improve the accuracy of the results. Andrew Ng who is a familiar name in the circle of ML enthusiasts spearheaded the data-centric AI movement in 2019 where he stressed the need to shift from a model-centric approach to a data-centric approach. According to him, data centricity is a mindset as much as it is a technical architecture. It acknowledges data’s valuable and versatile role in the ML pipeline. In contrast to the model-centric approach, a data-centric architecture is one where data exists independently of a singular application and can empower a broad range of stakeholders. This allows for greater opportunities in accelerating digital transformation: data can be more versatile, integrative, and available to those that need it.
Though data processing is of paramount importance in machine learning, it is often treated as a preliminary step to be carried out before working on the ML algorithm. The focus is mostly laid on making the available data more machine-actionable. As a result, hundreds of hours are wasted on tuning a model built on low-quality data. That’s one of the main reasons why the accuracy of a model is significantly lower than expected and it has nothing to do with the efficiency of model tuning.
”With the increasing power and availability of machine learning models, gains from model improvements have become marginal”. — Hazy Research, Stanford
Therefore, an improvement in current data practices is of paramount importance in building reliable machine learning products.
So, how can a data-centric approach improve predictive outcomes?
Once the realization sinks in that we need to focus on data as much as or maybe more than on the model, we come to the various aspects needed in a data-centric approach. In this blog, we will consider Biomedical data as the domain where ML is applied.
In a data-centric approach, the data is viewed through the lens of a domain expert as well as that of the ML expert. The ML expert will work on the curated data which is in machine-readable/ actionable form, use it as training data, train the model to perform a specific function. The model-centric approach will then go forward to improve the accuracy of prediction by improving the algorithm whereas the data-centric approach will improve the training data by reiterating data quality from the point of view of the domain expert to increase the accuracy of results obtained by the same model. But what if our highly annotated dataset does not account for the real-world variance..!
The answer lies in how data curation is perceived..
Data curation is the work of organizing and managing a collection of datasets to meet the needs and interests of a specific group of people. It means different things for people from different domains. For an ML engineer, curated data equates to relevant data which is arranged in a specific form, following a specific format that is machine-readable/ actionable. However, for a domain expert, the quality of data or curated data lays importance on very different aspects of the data (discussed below).
If the data taken for training an algorithm is biased, it can lead to inaccurate predictions even if the model is highly advanced. This gap is what is bridged in a data-centric approach. For example, while trying to predict a disease based on a particular gene expression, even if the training dataset has accurately annotated data from 10 females and 90 males for the study, an increase in efficiency of the model cannot be ensured by iterating on the model. However, if we iterate on the data, correct this gender bias, we can probably get a better outcome.
Data-centric approach invokes critical thinking about data curation in terms of
At Elucidata, we believe that data curation is the core of the data-centric approach
We have an expert team of Bioinformaticians, Biologists, Data curators, Data scientists, and ML Engineers who can look at biomedical data holistically to help our clients with their specific research needs and come up with custom curation pipelines to fast track their research. With the understanding that high-quality machine actionable data is central to biomedical research, we have more than 1 Million machine-actionable Biomedical datasets with exponential growth in the number each quarter on Polly, our cloud platform. In Biomedical research and drug discovery, every second is of infinite value. We, armed with the scientific expertise and over 1Million ML-ready datasets, are ready to hold hands with stakeholders from the biomedical industry to help them reach their goals faster.
Elucidata is well equipped to help you accelerate your biomedical research! To know more about our resources and services, write to us at info@elucidata.io
References:
All You Need To Know About Data Curation | iunera