Elucidata's Unique Data Quality Assurance Process

Shruti Desai, Deepthi Das

The advent of the digital era, advancements in data sharing, and high throughput technologies have broadened the scope of data usage, data sharing, and collaborative research. Different stakeholders (data generators, data custodians, data managers, and data consumers) are involved in the data lifecycle. Data has become an independent asset, and its stakeholders might be spread across different geographies, subject domains, etc. It is paramount to ensure data quality and that data doesn’t lose its integrity while moving along different stages of its life cycle. Hence, at Elucidata, we have set up a system to perpetually assess the data and its metadata quality critically to ensure that our data keeps up the promise of reliability and interoperability that we pledge to our customers.‍

Data Quality

‍Data quality traits can be categorized into two groups based on whether those traits are inherent to the data (intrinsic) or not (extrinsic). Ensuring intrinsic data quality (Eg., removing bias in the data, capturing optimum number of data points in an experiment, etc.) is the mandate of data generators. It can mostly only be improved at the source, whereas extrinsic data quality (Eg.,: ensuring the correctness of the metadata fields, accurate annotations, etc.) generally depends on data custodians and data managers and can be improved through data curation.

Extrinsic and intrinsic data traits flowchart — Color-coded image to identify extrinsic and intrinsic data traits

‍
Why Do We Need a Continuous Data Quality Assessment Process?

Quality of data is notional and not always absolutely quantifiable. Voluminous data ingestion tends to have some errors which creep in due to the variation in how the experimenters fill in the metadata details or how each curator processes certain information. Therefore a continuous monitoring and correction system is needed to ensure the highest level of extrinsic data quality and streamlined ingestion and curation process.

How Do We Assess and Ensure Extrinsic Data Quality?

Extrinsic data quality assurance needs careful consideration in several aspects, such as:

1. Standardization - deals with conformance of field names and values to ontologies and controlled vocabularies as well as the formats specified for those fields. It enables data to be more searchable and findable.

2. Accuracy of information - pertains to the correctness and plausibility of the data. It builds trust in the data.

3. Integrity - highlights the truthfulness and concordance of data.

4. Breadth - pertains to ensuring sufficient curated fields for a user to understand and use the data for analysis. This enables data to be more accessible to the user.

5. Completeness - primarily deals with eliminating the missingness in data points. It enables the user to use all the available data and not lose information to incompleteness.

At Elucidata, we have devised a data quality assessment approach to ensure that all these aspects are taken care of and that the data is FAIR (Findable, Accessible, Interoperable, and Reusable) before it reaches the consumable stage on our data-centric ML Ops platform, Polly.

The process has two main parts:
1. The Validation Layer - understanding the data issues and creating the rulesets needed for the computational program to highlight errors in the data

2. The Correction Layer - correcting the errors that were found in the validation layer

Each validation and correction layer is further divided into processes that could be dealt with computationally and those that would need manual intervention. The validation layer gives us a typical list of errors, which range from schema issues to contextual anomalies. Though trivial, these errors accumulate and magnify, making it difficult to access, query, and integrate data across the system. To improve the data quality, we needed to correct these. Adapting a two-pronged approach, our team groups the errors that can be handled systematically and those that need human expertise.

The two-pronged approach to group and correct errors in data — The two-pronged approach to group and correct errors

The expert-curated rulesets and guidelines corrected errors more smoothly than expected. Each dataset had multiple sample labels, and each sample had multiple descriptive labels. Across Polly, all labels for seven repos (TCGA, GEO, cBioPortal, DepMap, ImmPort, GDC, CPTAC) were assessed at dataset Level (~9.9 million) and sample level (~ 34.87 million), and 8% of labels were erroneous. Of those, we corrected 99% of the errors. We could automate the curation of more than 94% of the errors distributed across labels, with the remaining being dealt with by manual curators.

Given the sheer volume of datasets (~400k) at our disposal, error collection using simple iteration over an initial run of six OmixAtlases took an unreasonable ~7 hours to execute. To tackle the same, we implemented the multiprocessing system on Polly and lowered the execution time to under 4 min. This demonstrates the computational efficiency of the system we have developed.

Additionally, to include more stringent checks on ingested data that would capture the data quality for standardization, accuracy, and integrity of data, we use the ‘pydantic’ library that allowed us to perform schema checks, field-specific checks as well as certain logical checks across multiple fields on the dataset metadata.

Increasing the Reliability of Data on Polly

The quality validations described above are embedded into the system by packaging the validation and correction algorithms into our ingestion libraries. Any data that comes into Polly has to first pass the quality assessment mechanism. Additionally, for the rare event of ambiguous data evading the automatic surveillance and finding its way into Polly, we have set up a manual validation layer with expert human auditors continuously inspecting the quality of data in Polly.

In conclusion, we behold the promise of “richly curated biomedical molecular data” very near to our hearts and make it richer every passing day. Connect with us to learn more about the 1.5M highly curated ML-ready biomolecular datasets on our Polly platform!

Other Resources

Blogs Case Studies Dataset Roundup Documentation Glossary Webinars Whitepapers

Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.

Oops! Something went wrong while submitting the form.

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Upcoming Webinar - What the AI Co-Scientist Paper Actually Demonstrates for Biologists and Data Scientists

Register now

[Upcoming Webinar] Scaling High-Quality Data Processing: Achieve 4x Cost Reduction for Foundation ModelsRegister Now->

Reserve Your Seat

Pharma Company Achieves 4x Faster Target Identification for Inflammatory Disease

Key Highlights

What’s a Rich Text element?

Static and dynamic content editing

How to customize formatting for each rich text

All Solution Briefs

Other Resources

Elucidata's Unique Data Quality Assurance Process

Data Quality

‍Why Do We Need a Continuous Data Quality Assessment Process?

How Do We Assess and Ensure Extrinsic Data Quality?

Increasing the Reliability of Data on Polly

Other Resources

Talk to our Data Expert

More Solution Briefs

Faster Insights on Omics Data Signatures with Polly Discover

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

How to Perform Patient Stratification on Polly

ChatGPT in Drug Discovery

Solving Biomedical Data Findability Issues Using Polly

How to Compare Gene Signatures on Polly

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

How does Polly help in training classifier models for patient stratification?

How does Polly assist in defining genetic signatures for different stages of cell differentiation?

What is the process of creating a disease-specific atlas using Polly’s harmonization engine?

How does Polly integrate multiple data types for more reliable patient stratification?

Can Polly handle data quality issues and unstructured data from public repositories?

How does Polly harmonize multi-omic datasets to improve the quality of patient stratification?

How does Elucidata's Polly help in overcoming the challenges of patient stratification?

What challenges do researchers face when performing patient stratification using multi-omics data?

What is patient stratification, and why is it important for precision medicine?

What are the key advantages of using Polly for transcriptome profiling and biomarker identification?

What methodologies does Polly use to identify synergistic drug combinations?

How does Polly rank datasets similar to a gene signature query?

What steps are involved in creating a query gene signature on Polly?

How does Polly's RNA-Seq Atlas simplify gene signature analysis?

What is gene signature comparison, and why is it important in drug discovery?

Get the latest news, industry insights, and updates delivered directly to your inbox.

All Solution Briefs

Faster Insights on Omics Data Signatures with Polly Discover

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

How to Perform Patient Stratification on Polly

ChatGPT in Drug Discovery

Solving Biomedical Data Findability Issues Using Polly

How to Compare Gene Signatures on Polly

info@elucidata.io

info@elucidata.io

info@elucidata.io

‍
Why Do We Need a Continuous Data Quality Assessment Process?