“Improving the data is not a preprocessing step that you do once. It’s part of the iterative process of model development.” – Andrew Ng
In today’s data-driven landscape, leveraging machine learning to make sense of large datasets has become an integral aspect of just about any industry in existence. A typical data scientist usually focuses on developing, deploying and improving ML models to answer research questions. But there is more to MLOps than just code; every machine learning model is only as good as the data. This is the cornerstone of the Data-centric AI movement pioneered by technologist and entrepreneur, Andrew Ng. At a key note presentation for DeepLearning.AI, Ng discussed the evolution of the movement and provided some handy tips for data teams to implement data-centricity.
Prioritize data over code
Ng has been a long-time advocate of the ‘good data’ campaign. “..The code is a solved problem for many applications,” he said, in reference to the focus on ML code and model development over the past few years. He posited that flipping the existing paradigm – holding the code as fixed and improving and iterating upon the data instead of the other way around – will be a more fruitful and beneficial approach going forward.
According to Ng, the key value of a data-centric approach is that it allows people across a variety of industries to customize the data they generate, and enable a person without expertise to reap the benefits of ML. This is done by feeding high quality training data into an open-source model, unlocking a plethora of novel opportunities in settings where ML has not traditionally been utilized due to the lack of usable data/expertise. For instance, the biopharmaceutical industry has increasingly started to rely upon the vast troves of biomedical molecular data available in publications and controlled repositories in addition to data generated in-house for drug development. A significant hurdle here, despite data availability, is the lack of usable data. A data-centric approach would empower scientists in pharmaceutical companies to unlock the power of data to accelerate drug discovery.
The need for a framework
“Many data scientists have their own ways to clean data but what we don’t have is a systematic mental framework for doing it,” said Ng. Typically, there is considerable variability in the way data is labeled. A robust framework would make the process more consistent, thereby improving outcomes. Setting down some guiding principles will also lay the foundation for the development of tools that can be systematically employed by data teams.
Working with semi/unstructured data is an arduous endeavor. For teams dealing with this kind of data and aspiring to adopt data-centricity, Ng listed a few pointers:
- Make labels y consistent by mapping labels to a deterministic, non-random function.
- Use multiple labelers to spot inconsistencies and pick a reasonable standard to avoid further ambiguities.
- Clarify labeling instructions by tracking down ambiguous or inconsistent examples, define how they should be labeled and document them in the labeling instructions.
- Toss out noisy examples. More data does not always equate to better data.
- Use error analysis to identify and focus on the subset of data that has to be improved.
Ng emphasized that ML is an empirical and reiterative process. The ideal data-centric ML workflow, in his opinion, starts with training a model, followed by error analysis to define next steps, improving the data (rather than modifying the code) and retraining the model. “Improving the data is not a preprocessing step that you do once. It’s part of the iterative process of model development,” he concluded.