The advent of big data has ushered in transformative shifts in the ever-evolving landscape of biopharma R&D. As we navigate the intricate web of patient-level insights, data lakes, the rise of data scientists, and the transformative impact of new technologies, a crucial question arises:
How can the life sciences industry curate this vast trove of information to extract meaningful and profitable insights?
This blog delves into the pressing need for effective curation strategies (manual vs. automated curation) in the current biopharma R&D scenario, unraveling the complexities and opportunities intertwined with the big data revolution.
Biomedical Data Curation is a high-value task as experts carefully examine the relevant scientific literature, extracting essential information. This process generates corresponding database records in a structured way. This includes details like biological functions and relationships between entities, resulting in structured database records. The expansion of high-throughput technologies has led to a substantial increase in biological data, accompanied by a rise in published research papers. Consequently, there is an escalating demand for high-quality curation that utilizes these resources. However, this urgency poses challenges for curators in locating and integrating scientific findings from the literature.
Manual curation refers to the process of reviewing and refining data manually to ensure accuracy, completeness, and relevance. Curation begins with a thorough review of the data to identify any errors, inconsistencies, or missing information. This assessment process is essential for ensuring the overall quality and reliability of the data.
Once the relevant datasets are identified, researchers will annotate and validate the data manually. This involves adding relevant information, such as gene annotations, experimental conditions, and other metadata, to enhance the context of the data. The heterogeneous data (from different sources/ instruments) needs to be standardized (converted to a standard format) and normalized for downstream analysis. Biopharma research often involves working with diverse datasets from various sources.
Harmonizing these datasets to ensure compatibility and coherence in the analyses is therefore the next step. Assessing the biological relevance of the data is crucial for applications such as patient stratification, biomarker discovery or in the context of specific biological processes, pathways, or disease mechanisms to ensure that the analyses are meaningful and aligned with the research.
The amount of biological data in GenBank is doubling every two years. This pace is not unique to GenBank. With the advent of high throughput technologies, the scale of data production is astronomical. Research methodology has evolved from a hypothesis-validation process to data-driven one, maximizing the use of the data. Manual curation process presents some blockers to the data-driven approach. Here are a few important ones:
In addressing the above-mentioned challenges, the integration of automated curation systems becomes pivotal. Automated curation tools leverage machine learning algorithms and artificial intelligence to process vast datasets efficiently, ensuring accuracy, scalability, and adaptability to the evolving landscape of life sciences research. Automation can be carried out for each stage of the curation process- Scouring the data repositories for keywords of interest, checking metadata for completeness, standardization, harmonization etc. Automated curation presents several advantages over the manual curation process like:
While automated curation processes offer efficiency and speed, human involvement in biopharma research is crucial for ensuring accurate, context-aware, and adaptable curation, particularly in dealing with the intricacies of scientific literature. Elucidata has been a pioneer in delivering high-quality data and has been serving top pharma companies for several years. We consistently utilize state-of-the-art technology to scale up the data curation process with an expert in the loop, ensuring that the data provided to our customers is of the highest standard.
Staying at the forefront of the ever-evolving landscape of machine learning x curation, we've pioneered a cutting-edge approach through GPT, a large language model, specifically for biomedical data. Our goal was to enhance the automated curation process, ensuring a perfect blend of precision and speed. Despite having a robust biocuration process, we encountered resource constraints and prolonged training times for our fine tuned BERT models. To overcome these challenges, we ventured into the realm of ChatGPT and prompt-based engineering, revolutionizing our approach to extracting biomedical entities from publications.
The outcome of this exploration has been nothing short of remarkable. Leveraging ChatGPT, we've achieved an impressive accuracy and F1 score, reaching close to 83% in sample-level disease extraction. This transformative journey in improving automated curation didn't stop there. Over the last year, our efforts have focused on ChatGPT models to expand their capabilities within the biomedical domain. Each stage of the process, from data ingestion to harmonization, has been significantly accelerated with the assistance of GPT models, achieving 10x curation speed. Our human-in-the-loop curation model, comprising over 100 experts in curation, NLP, data engineering, and bioinformatics, plays a crucial role in ensuring the delivery of data with an exceptional level of quality—reaching an accuracy of 99.99%.
In addition to this, our innovation extends to the development of PollyGPT, a powerful tool designed to dive into vast corpora of data and respond to biological questions posed in natural language. This marks a significant stride forward, bringing a new dimension to the intersection of technology and life sciences.
In the ever-evolving landscape of biopharma research, where data-driven approaches are becoming increasingly crucial, staying abreast of technology is essential. Don't miss out – Connect with us or reach out to us at info@elucidata.io to learn more!