Getting Started with Data-Centric AI Development: Tips from Andrew Ng

“Improving the data is not a preprocessing step that you do once. It’s part of the iterative process of model development.” – Andrew Ng

In today’s data-driven landscape, leveraging machine learning to make sense of large datasets has become an integral aspect of just about any industry in existence. A typical data scientist usually focuses on developing, deploying and improving ML models to answer research questions. But there is more to MLOps than just code; every machine learning model is only as good as the data. This is the cornerstone of the Data-centric AI movement pioneered by technologist and entrepreneur, Andrew Ng. At a key note presentation for DeepLearning.AI, Ng discussed the evolution of the movement and provided some handy tips for data teams to implement data-centricity.

‍

Prioritize data over code

‍

‍

Ng has been a long-time advocate of the ‘good data’ campaign. “..The code is a solved problem for many applications,” he said, in reference to the focus on ML code and model development over the past few years. He posited that flipping the existing paradigm – holding the code as fixed and improving and iterating upon the data instead of the other way around – will be a more fruitful and beneficial approach going forward.

According to Ng, the key value of a data-centric approach is that it allows people across a variety of industries to customize the data they generate, and enable a person without expertise to reap the benefits of ML. This is done by feeding high quality training data into an open-source model, unlocking a plethora of novel opportunities in settings where ML has not traditionally been utilized due to the lack of usable data/expertise. For instance, the biopharmaceutical industry has increasingly started to rely upon the vast troves of biomedical molecular data available in publications and controlled repositories in addition to data generated in-house for drug development. A significant hurdle here, despite data availability, is the lack of usable data. A data-centric approach would empower scientists in pharmaceutical companies to unlock the power of data to accelerate drug discovery.

‍

The need for a framework

‍

“Many data scientists have their own ways to clean data but what we don’t have is a systematic mental framework for doing it,” said Ng. Typically, there is considerable variability in the way data is labeled. A robust framework would make the process more consistent, thereby improving outcomes. Setting down some guiding principles will also lay the foundation for the development of tools that can be systematically employed by data teams.

Implementation tips

Working with semi/unstructured data is an arduous endeavor. For teams dealing with this kind of data and aspiring to adopt data-centricity, Ng listed a few pointers:

Make labels y consistent by mapping labels to a deterministic, non-random function.
Use multiple labelers to spot inconsistencies and pick a reasonable standard to avoid further ambiguities.
Clarify labeling instructions by tracking down ambiguous or inconsistent examples, define how they should be labeled and document them in the labeling instructions.
Toss out noisy examples. More data does not always equate to better data.
Use error analysis to identify and focus on the subset of data that has to be improved.

Ng emphasized that ML is an empirical and reiterative process. The ideal data-centric ML workflow, in his opinion, starts with training a model, followed by error analysis to define next steps, improving the data (rather than modifying the code) and retraining the model. “Improving the data is not a preprocessing step that you do once. It’s part of the iterative process of model development,” he concluded.

Blog Categories

Data Analysis and Management

Data Quality & Compliance

Industry Features

Product & Engineering

Data Science & Machine Learning

Company & Culture

FAIR Data

Others

Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.

Oops! Something went wrong while submitting the form.

Other Resources

Case Studies Dataset Roundup Documentation Glossary Solution Briefs Webinars Whitepapers

Upcoming Webinar - AlphaGenome Unpacked: Promise, Progress, and What Comes Next for AI in Genomics

Join us

[Upcoming Webinar] Scaling High-Quality Data Processing: Achieve 4x Cost Reduction for Foundation ModelsRegister Now->

Reserve Your Seat

Getting Started with Data-Centric AI Development: Tips from Andrew Ng

Prioritize data over code

The need for a framework

Blog Categories

Talk to our Data Expert

Other Resources

Related Blogs

How to Choose the Right Data Analytics Platform for Biopharma Research

Navigating the Future of Healthcare AI: Opportunities, Challenges, and Ethical Considerations

Clinical Trials Data: Best Practices for Effective Analysis and Integration

AI Agents in Healthcare: Real Use Cases, Benefits, and How to Deploy Them Effectively

Scalable Infrastructure for Biomedical Data: Best Practices and Common Pitfalls to Avoid

Understanding Knowledge Graphs: Definition, Benefits, and Best Practices

Blog Categories

Get the latest news, industry insights, and updates delivered directly to your inbox.

Latest Blogs

How to Choose the Right Data Analytics Platform for Biopharma Research

How to Choose the Right Data Analytics Platform for Biopharma Research

Navigating the Future of Healthcare AI: Opportunities, Challenges, and Ethical Considerations

Navigating the Future of Healthcare AI: Opportunities, Challenges, and Ethical Considerations

Clinical Trials Data: Best Practices for Effective Analysis and Integration

Clinical Trials Data: Best Practices for Effective Analysis and Integration

AI Agents in Healthcare: Real Use Cases, Benefits, and How to Deploy Them Effectively

AI Agents in Healthcare: Real Use Cases, Benefits, and How to Deploy Them Effectively

Scalable Infrastructure for Biomedical Data: Best Practices and Common Pitfalls to Avoid

Scalable Infrastructure for Biomedical Data: Best Practices and Common Pitfalls to Avoid

Understanding Knowledge Graphs: Definition, Benefits, and Best Practices

Understanding Knowledge Graphs: Definition, Benefits, and Best Practices

Trending Blogs

How to Choose the Right Data Analytics Platform for Biopharma Research

Clinical Trials Data: Best Practices for Effective Analysis and Integration

Scaling Data Pipelines for High-throughput Bioinformatics

Decoding Complexities: The Critical Role of Deconvolution in Spatial Transcriptomics

Challenges with Diagnostics Data Processing Pipelines

info@elucidata.io

info@elucidata.io

info@elucidata.io