“Data that is loved tends to survive.” - Kurt Bollacker
…And give great insights!!
Data serves as the cornerstone of bioinformatic breakthroughs. However, messy data can pose significant obstacles on the journey from raw data to meaningful insights. Information sourced from diverse public repositories often lacks consistent formatting and vital metadata annotations. The absence of contextual and structured information diminishes the findability and reusability of relevant data. Read this blog to understand how meticulously curated datasets can significantly impact and accelerate biomedical R&D by improving the findability and reusability of data.
Yes, they do. In the realm of data repositories, the concept of 'curated datasets' is akin to a library where books are meticulously organized based on various criteria. However, like libraries, these curated datasets may not always align perfectly with specific user queries, presenting a limitation to their utility. This disparity prompts a comparison to platforms like Google Scholar, renowned for its deep indexing capabilities that facilitate more relevant findings. Similarly, deep curation, considering downstream analyses, becomes imperative to extract value from the data housed within these repositories.
Let’s look at some specific challenges associated with public repositories that necessitate deep curation to unlock the value of its data.
While repositories play a crucial role in making biological data available to the research community, ensuring the usability of these datasets requires addressing various technical, quality, and accessibility challenges. This is where Elucidata steps in, utilizing cutting-edge AI models to address data quality issues, enabling researchers to fully leverage the wealth of public biomedical data for their research objectives.
Polly, Elucidata's data harmonization platform, effortlessly overcomes the significant data quality challenges found in publicly available datasets from diverse sources such as GEO, PRIDE, CPTAC, and various publications. By employing advanced AI algorithms, Polly harmonizes multi-omics and assay data, transforming them into machine learning (ML)-compatible formats. Trained experts utilize Polly's robust harmonization engine to curate diverse data types, annotate metadata, and ensure consistent processing, all while keeping costs affordable. The resulting ML-ready datasets are stored in Polly's Atlas or any preferred platform, facilitating seamless analysis and management.
To demonstrate the benefits of meticulous deep curation, we analyze the effectiveness of data retrieval across data from three distinct sources, each containing the same datasets from CREEDS:
1. Unprocessed data directly from GEO
2. Data manually curated by CREEDS
3. The same datasets but curated through our Polly Harmonization Engine
These sources represent varying levels of data quality, with raw GEO data at the lower end and
the data curated by Polly Harmonization Engine at the higher end in terms of quality. The experiment was carried out using state of the art Named Entity Recognition (NER) models which have the ability to process text based queries on the data corpus.
The experiment demonstrated a significant improvement in the search responses with the Polly Harmonized version of the data corpus, in contrast to the other two sources. The NER model-enabled search against the Polly Harmonized corpus accurately retrieved the datasets for most of the tested queries. Conversely, there was a significant variance in metrics among queries, and poorer outcomes (lower scores), when using the raw data source (GEO) and CREEDS. The Polly Harmonized data guaranteed precise responses to queries while diminishing the likelihood of overlooking relevant datasets.
Read this whitepaper for more details on this case study.
This study, conducted on a fair sample of real queries, emphasizes the vital role of data quality in retrieving pertinent information from a data collection. It's not enough for a language-understanding AI to comprehend user questions accurately; the underlying knowledge base must also be meticulously curated, annotated, and structured to aid in finding relevant data. Both aspects of the search process must collaborate to efficiently translate user queries and provide contextually precise responses. The results of the study underscore the importance of high-quality, deeply curated metadata in navigating large-scale biomedical datasets.
Connect with us or reach out to us at info@elucidata.io to learn more.