Transformative Trends: Manual and Automated Curation Approaches in Biopharma Research

The advent of big data has ushered in transformative shifts in the ever-evolving landscape of biopharma R&D. As we navigate the intricate web of patient-level insights, data lakes, the rise of data scientists, and the transformative impact of new technologies, a crucial question arises:

How can the life sciences industry curate this vast trove of information to extract meaningful and profitable insights?

This blog delves into the pressing need for effective curation strategies (manual vs. automated curation) in the current biopharma R&D scenario, unraveling the complexities and opportunities intertwined with the big data revolution.

What is Biomedical Data Curation?

Biomedical Data Curation is a high-value task as experts carefully examine the relevant scientific literature, extracting essential information. This process generates corresponding database records in a structured way. This includes details like biological functions and relationships between entities, resulting in structured database records. The expansion of high-throughput technologies has led to a substantial increase in biological data, accompanied by a rise in published research papers. Consequently, there is an escalating demand for high-quality curation that utilizes these resources. However, this urgency poses challenges for curators in locating and integrating scientific findings from the literature.

What Is the (Traditional) Manual Curation Process?

Manual curation refers to the process of reviewing and refining data manually to ensure accuracy, completeness, and relevance. Curation begins with a thorough review of the data to identify any errors, inconsistencies, or missing information. This assessment process is essential for ensuring the overall quality and reliability of the data.

Once the relevant datasets are identified, researchers will annotate and validate the data manually. This involves adding relevant information, such as gene annotations, experimental conditions, and other metadata, to enhance the context of the data. The heterogeneous data (from different sources/ instruments) needs to be standardized (converted to a standard format) and normalized for downstream analysis. Biopharma research often involves working with diverse datasets from various sources.

‍Harmonizing these datasets to ensure compatibility and coherence in the analyses is therefore the next step. Assessing the biological relevance of the data is crucial for applications such as patient stratification, biomarker discovery or in the context of specific biological processes, pathways, or disease mechanisms to ensure that the analyses are meaningful and aligned with the research.

What Are the Challenges in the Traditional Manual Curation Approach?

The amount of biological data in GenBank is doubling every two years. This pace is not unique to GenBank. With the advent of high throughput technologies, the scale of data production is astronomical. Research methodology has evolved from a hypothesis-validation process to data-driven one, maximizing the use of the data. Manual curation process presents some blockers to the data-driven approach. Here are a few important ones:

Scale and Volume: Dealing with large-scale datasets generated by high-throughput technologies is a monumental task. Manual curation can become time-consuming and impractical when faced with the sheer volume of data, potentially leading to delays in research timelines.
Data Heterogeneity: Researchers lose precious time in harmonizing and integrating heterogeneous data types. The process is labor-intensive and could potentially result in inconsistencies.
Subjectivity and Bias: Manual curation introduces the element of subjectivity, as different curators may interpret and annotate data differently. This can lead to variations and biases in the curated datasets, affecting the reliability of downstream analyses.
Resource Intensiveness: Skilled experts are required for manual curation. The resource-intensive nature of manual curation, both in terms of time and expertise, can be a bottleneck in handling large-scale datasets efficiently.
Continuous Updates: Life sciences knowledge evolves rapidly, necessitating continuous updates to curated datasets. Researchers may struggle to keep pace with the dynamic nature of scientific information, impeding timely decision-making and hindering the pace of scientific discovery.

How Can Automated Curation Help?

In addressing the above-mentioned challenges, the integration of automated curation systems becomes pivotal. Automated curation tools leverage machine learning algorithms and artificial intelligence to process vast datasets efficiently, ensuring accuracy, scalability, and adaptability to the evolving landscape of life sciences research. Automation can be carried out for each stage of the curation process- Scouring the data repositories for keywords of interest, checking metadata for completeness, standardization, harmonization etc. Automated curation presents several advantages over the manual curation process like:

Scalability and Speed: Automation allows for the efficient processing of large quantities of data, which can be time-consuming and difficult to manage manually. This enables researchers to analyze a vast amount of relevant data within a reasonable time thereby deriving statistically sound data-driven insights.
Improved Accuracy and Reduced Bias: In some cases, automated curation processes can surpass human accuracy, particularly when it comes to repetitive tasks or those that require a high level of precision. For example, automated systems can be used to identify anomalies in large datasets with a higher degree of accuracy than humans. Also, as it follows a set of predefined rules, the chances of a subjective bias creeping in is very low in an automated process.
Cost Efficiency: Standardizing and harmonizing biological data is essential due to its heterogeneity, requiring expert attention for accuracy. This repetitive and time-consuming task detracts from researchers' ability to derive insights. Automated curation proves cost-effective, sparing researchers from manual effort by handling repetitive and time-intensive activities.
New Possibilities: Automation opens new avenues by tackling tasks that are impossible manually. For instance, auto-curation facilitates constructing complex knowledge graphs, challenging to create manually due to the vast data involved. Leveraging existing data in novel ways becomes possible through effective curation, enabling the extraction of valuable insights.For manually curating a single dataset consisting of approximately 50-60 samples, the researcher must locate the publication, use code to extract associated metadata, and manually inspect and document all curation fields, a process that takes a considerable amount of time(~2-3 hrs in this case). On the other hand, an efficient automated curation process, even with an expert double-checking each step, can complete the task in just 2-3 minutes per dataset. This acceleration in the curation process is significant when scaling up. Manual dependence not only restricts the number of datasets processed within a reasonable time-frame but also hampers scalability. As the number of datasets to be curated increases, accuracy becomes a concern due to the likelihood of oversights and errors when dealing with a large volume of data manually.

Elucidata’s Automated Curation Model with Humans-in-the-loop for Better Efficiency

While automated curation processes offer efficiency and speed, human involvement in biopharma research is crucial for ensuring accurate, context-aware, and adaptable curation, particularly in dealing with the intricacies of scientific literature. Elucidata has been a pioneer in delivering high-quality data and has been serving top pharma companies for several years. We consistently utilize state-of-the-art technology to scale up the data curation process with an expert in the loop, ensuring that the data provided to our customers is of the highest standard.

Manual and Automated Curation Approaches in Biopharma Research

Staying at the forefront of the ever-evolving landscape of machine learning x curation, we've pioneered a cutting-edge approach through GPT, a large language model, specifically for biomedical data. Our goal was to enhance the automated curation process, ensuring a perfect blend of precision and speed. Despite having a robust biocuration process, we encountered resource constraints and prolonged training times for our fine tuned BERT models. To overcome these challenges, we ventured into the realm of ChatGPT and prompt-based engineering, revolutionizing our approach to extracting biomedical entities from publications.

The outcome of this exploration has been nothing short of remarkable. Leveraging ChatGPT, we've achieved an impressive accuracy and F1 score, reaching close to 83% in sample-level disease extraction. This transformative journey in improving automated curation didn't stop there. Over the last year, our efforts have focused on ChatGPT models to expand their capabilities within the biomedical domain. Each stage of the process, from data ingestion to harmonization, has been significantly accelerated with the assistance of GPT models, achieving 10x curation speed. Our human-in-the-loop curation model, comprising over 100 experts in curation, NLP, data engineering, and bioinformatics, plays a crucial role in ensuring the delivery of data with an exceptional level of quality—reaching an accuracy of 99.99%.

In addition to this, our innovation extends to the development of PollyGPT, a powerful tool designed to dive into vast corpora of data and respond to biological questions posed in natural language. This marks a significant stride forward, bringing a new dimension to the intersection of technology and life sciences.

In the ever-evolving landscape of biopharma research, where data-driven approaches are becoming increasingly crucial, staying abreast of technology is essential. Don't miss out – Connect with us or reach out to us at info@elucidata.io to learn more!

‍