The advent of the digital era, advancements in data sharing, and high throughput technologies have broadened the scope of data usage, data sharing, and collaborative research. Different stakeholders (data generators, data custodians, data managers, and data consumers) are involved in the data lifecycle. Data has become an independent asset, and its stakeholders might be spread across different geographies, subject domains, etc. It is paramount to ensure data quality and that data doesn’t lose its integrity while moving along different stages of its life cycle. Hence, at Elucidata, we have set up a system to perpetually assess the data and its metadata quality critically to ensure that our data keeps up the promise of reliability and interoperability that we pledge to our customers.
Data quality traits can be categorized into two groups based on whether those traits are inherent to the data (intrinsic) or not (extrinsic). Ensuring intrinsic data quality (Eg., removing bias in the data, capturing optimum number of data points in an experiment, etc.) is the mandate of data generators. It can mostly only be improved at the source, whereas extrinsic data quality (Eg.,: ensuring the correctness of the metadata fields, accurate annotations, etc.) generally depends on data custodians and data managers and can be improved through data curation.
Quality of data is notional and not always absolutely quantifiable. Voluminous data ingestion tends to have some errors which creep in due to the variation in how the experimenters fill in the metadata details or how each curator processes certain information. Therefore a continuous monitoring and correction system is needed to ensure the highest level of extrinsic data quality and streamlined ingestion and curation process.
Extrinsic data quality assurance needs careful consideration in several aspects, such as:
1. Standardization - deals with conformance of field names and values to ontologies and controlled vocabularies as well as the formats specified for those fields. It enables data to be more searchable and findable.
2. Accuracy of information - pertains to the correctness and plausibility of the data. It builds trust in the data.
3. Integrity - highlights the truthfulness and concordance of data.
4. Breadth - pertains to ensuring sufficient curated fields for a user to understand and use the data for analysis. This enables data to be more accessible to the user.
5. Completeness - primarily deals with eliminating the missingness in data points. It enables the user to use all the available data and not lose information to incompleteness.
At Elucidata, we have devised a data quality assessment approach to ensure that all these aspects are taken care of and that the data is FAIR (Findable, Accessible, Interoperable, and Reusable) before it reaches the consumable stage on our data-centric ML Ops platform, Polly.
The process has two main parts:
1. The Validation Layer - understanding the data issues and creating the rulesets needed for the computational program to highlight errors in the data
2. The Correction Layer - correcting the errors that were found in the validation layer
Each validation and correction layer is further divided into processes that could be dealt with computationally and those that would need manual intervention. The validation layer gives us a typical list of errors, which range from schema issues to contextual anomalies. Though trivial, these errors accumulate and magnify, making it difficult to access, query, and integrate data across the system. To improve the data quality, we needed to correct these. Adapting a two-pronged approach, our team groups the errors that can be handled systematically and those that need human expertise.
The expert-curated rulesets and guidelines corrected errors more smoothly than expected. Each dataset had multiple sample labels, and each sample had multiple descriptive labels. Across Polly, all labels for seven repos (TCGA, GEO, cBioPortal, DepMap, ImmPort, GDC, CPTAC) were assessed at dataset Level (~9.9 million) and sample level (~ 34.87 million), and 8% of labels were erroneous. Of those, we corrected 99% of the errors. We could automate the curation of more than 94% of the errors distributed across labels, with the remaining being dealt with by manual curators.
Given the sheer volume of datasets (~400k) at our disposal, error collection using simple iteration over an initial run of six OmixAtlases took an unreasonable ~7 hours to execute. To tackle the same, we implemented the multiprocessing system on Polly and lowered the execution time to under 4 min. This demonstrates the computational efficiency of the system we have developed.
Additionally, to include more stringent checks on ingested data that would capture the data quality for standardization, accuracy, and integrity of data, we use the ‘pydantic’ library that allowed us to perform schema checks, field-specific checks as well as certain logical checks across multiple fields on the dataset metadata.
The quality validations described above are embedded into the system by packaging the validation and correction algorithms into our ingestion libraries. Any data that comes into Polly has to first pass the quality assessment mechanism. Additionally, for the rare event of ambiguous data evading the automatic surveillance and finding its way into Polly, we have set up a manual validation layer with expert human auditors continuously inspecting the quality of data in Polly.
In conclusion, we behold the promise of “richly curated biomedical molecular data” very near to our hearts and make it richer every passing day. Connect with us to learn more about the 1.5M highly curated ML-ready biomolecular datasets on our Polly platform!
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly provides access to a curated repository of RNA-seq datasets that are consistently processed and enriched with metadata. This harmonization allows researchers to efficiently search for datasets with similar transcriptional profiles, facilitating transcriptome profiling and biomarker identification.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly utilizes signature reversal and multivariate gene expression signatures to predict potential drug combinations. By analyzing publicly available transcriptomics data and drug signatures, Polly can identify drugs or compounds that may have therapeutic effects by reversing disease signatures.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly ranks similar datasets using cosine similarity scores, which measure how closely a dataset's transcriptional profile matches the query signature. This helps researchers quickly find relevant datasets for further analysis and validation.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Researchers define the biological process of interest, select a dataset, preprocess the data, identify differentially expressed genes, and validate the signature. Polly’s platform streamlines this process with expert support and ML-ready datasets.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Polly's RNA-Seq Atlas addresses challenges in extracting associated signatures from public databases by providing a curated resource of RNA-seq datasets collected from the Gene Expression Omnibus (GEO). This richly curated resource helps researchers to find datasets with similar transcriptional profiles to their gene sets of interest.
Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.
Gene signature comparison analyzes gene expression patterns to identify disease-related signatures. It helps researchers find drugs that can reverse disease signatures, aiding in therapeutic discoveries.