Data Science & Machine Learning

Why Do You Need Custom Biomedical Data Curation?

Elucidata
September 21, 2021

Finding a needle in a haystack.. is not too difficult….if you have a magnet! Likewise, finding relevant data from a stack of resources is not too difficult, provided you have the right tool.

Biomedical data curation serves as a crucial tool for efficiently navigating extensive volumes of semi-structured or unstructured biomedical data found across various repositories or publications. With the continuous influx of data, researchers often struggle to strike a balance between the overwhelming amount of information they receive and the limited time available to process and comprehend it. There is a growing consensus among researchers regarding the necessity for enhanced and more comprehensive curation strategies, although opinions may vary regarding the universality of such curation efforts.

Traditionally, curation has been perceived as a one-time, single-step process intended to be completed definitively. The assumption has been that once curated, the data would be suitable for all use cases. However, this approach may not adequately address the evolving needs of researchers, especially as data sets become larger and more complex.

To address these challenges, data discovery and management solutions play a vital role in facilitating efficient access to curated biomedical data. These solutions employ advanced algorithms and methodologies to automate the process of data curation, allowing researchers to quickly find relevant information tailored to their specific research interests. By integrating data discovery and management solutions into their workflow, researchers can streamline the curation process, optimize data utilization, and accelerate scientific discovery in the biomedical field.

Curation is not “one-size fits all”.

Research projects often entail unique data needs, prompting data-driven teams to view curation as a tailored process tailored to the specific requirements of each use case. This trend underscores the importance of high-quality curation, which, when executed effectively, can serve as a superpower for data-driven bioinformatics teams. Integrated data discovery and data management solutions play pivotal roles in facilitating this process, enabling teams to efficiently identify, access, and organize relevant data to meet their research objectives. By embracing tailored curation approaches and leveraging advanced data management tools, teams can optimize their workflows, enhance data quality, and drive scientific discovery in bioinformatics and related fields.

High-quality curation done as needed can be a superpower for data-driven bioinformatics teams. In this post, we talk about 4 reasons why custom curation is here to stay.

1. Custom Metadata Requirements

Typically, the standard metadata information provided for datasets (i.e., age, gender, cell type, cell line, drug, etc.) is not enough for the question at hand. Standard metadata works for ‘standard’ use cases. Not if your team is interested in looking at relevant datasets in CAR-T or if your team is studying the profile of blood samples to research women’s reproductive health.

In this case, the type/source of the blood sample i.e., menstrual blood, cervicovaginal blood, or whole blood, is critical. But it’s unlikely that such a metadata field exists ‘out of the box’. A  biologist interested in establishing the immunogenicity of approved antibodies may need to extract metadata information like the route of administration, neutralizing antibodies, heavy chain & light chain sequences, etc.

These are not isolated cases. Almost all research projects have unique requirements which necessitates specialized treatment of the available data to ensure its findability and reusability. Custom curation can be carried out manually or using specialized ML models. In either case, the process and QC are important to ensure that the fields are being attributed consistently.

2. Raw Data Consistently Processed Through Custom Pipelines

Researchers utilize data from multiple (public, premium, and proprietary) sources. But data processed differently can’t easily be compared. So, research groups often want to re-process raw data using their own ‘custom’ pipelines that can enable them to make ‘apple to apple’ comparisons.

Re-processing raw data is not trivial. Especially if your team doesn’t have a scalable way to do it. Some teams are well equipped because they have productionized their raw data processing pipelines. Many other teams aren’t as well prepared. For them, it would be useful to productionize processing pipelines like how we do, using Nextflow on Polly.

3. Better Data Integration

Different data types will have distinct formats. Moreover, data from multiple sources is stored in different ways. It is imperative that the data and associated metadata is standardized, normalized and harmonized in a way that is suitable for downstream analyses. For example, a research team that is interested in integrating patient data to different omic datasets, considering host genetics, clinical information and microbiome composition.  

Customized curation process can ensure that a specific ontology is used in metadata harmonization and data labelling and that data processing is carried out through the desired pipeline so that the data produced is ready for consumption in any integrative analyses to be performed downstream.

4. Access to the Latest Datasets Relevant to a Biological Question

Every week many interesting datasets get published. Bioinformaticians are often trying to analyze multiple datasets per week, and they want to move fast. Curation ends up becoming a bottleneck. For example, discovery teams looking for datasets specific to AML research may have to spend a chunk of their time finding & curating the latest studies from varied sources like CCLE, TCGA, BeatAML, etc.

Without a dedicated curation effort, researchers have to restrict themselves to one or two datasets they can analyze. Often, this is a choice - driven not by scientific reasoning but by resource constraints. If you have access to curated datasets internally or through a platform like Polly, you can analyze multiple datasets to shortlist the most appropriate ones to be explored further. More information usually means better scientific decisions.


In summary, data curation is not a one-time process; it's a continuous and adaptable effort. Empowering your team to engage in frequent and effective data curation is crucial, as science and data curation both thrive on iteration. Our biomedical data harmonization platform, Polly offers advanced data discovery and data management solutions fit for downstream analysis. We have developed pipelines and processes that can be readily customized to meet any research need, allowing researchers to access, manage, and analyze data efficiently. With Polly, teams can streamline their data curation efforts, ensuring that their analyses are based on reliable and relevant information, ultimately advancing scientific discovery and innovation.

Reach out to us at info@elucidata.io or click here for all your biomedical data curation needs.

Blog Categories

Blog Categories

Request Demo