ChatGPT in Drug Discovery

Vishal Samal, Shrushti Joshi

July 5, 2023

In today's digital age, where information is abundant and constantly evolving, curation has become more crucial than ever. Curators are the gatekeepers of the vast sea of knowledge, sifting through the overwhelming volume of data to deliver relevant, valuable, and engaging content to their audience. But with the exponential growth of information, curators face the challenge of efficiently and effectively navigating this vast landscape.

Enter ChatGPT, an AI-powered language model developed by Open AI and launched in November 2022.

With its advanced natural language processing capabilities, ChatGPT empowers curators to streamline their processes, enhance their findability, evaluate and organize information, and ultimately provide an elevated experience for their readers. ChatGPT has captivated the public's imagination like few other innovations. Its progress has surprised the machine learning community, surpassing the previous benchmark set by BERT models released in 2018.

”It won’t be a surprise to see, in the next 24 months, multiple billion-dollar companies built on top of OpenAI’s foundational models. The startups that will be the most successful won’t be the best at prompt engineering, which is the focus today; instead, success will be found in what novel data and use cases they incorporate into OpenAI’s models. This anonymous data and application will be the moat that establishes the next set of AI unicorns.” ~David Shim

The Life Sciences community is particularly interested in understanding the implications of ChatGPT for their work. In this blog, we dive into the world of curation and explore how ChatGPT can revolutionize this practice.

Why is Data Curation Important?

As public data repositories accept data in flexible arrangements, significant variations arise in how data is submitted. Consequently, intelligent systems are necessary to extract and categorize pertinent information from the provided metadata on these repositories.

Elucidata specializes in ingesting and providing omics datasets from diverse sources in standardized machine learning (ML)-ready formats to expedite drug development.

To address this challenge, Elucidata employs Biological Natural Language Processing (BioNLP) systems to curate its platform’s vast array of metadata, thereby automating the process. This system comprises two primary components:

One responsible for extracting relevant information from public data,
Other for harmonizing the data to a standardized vocabulary.

By leveraging these BioNLP systems, Elucidata establishes a consistent format across all its diverse data sources, significantly reducing the effort required to render public data usable. This standardized approach not only enhances the efficiency of data curation but also contributes to accelerating research and analysis in the field of drug development.

Data Curation Using BioNLP Systems (Before ChatGPT)

To streamline and standardize the curation process, we employ Bio-NLP systems to extract relevant entities from metadata, abstracts, and publications, automating the process effectively.

ChatGPT in Drug Discovery — Workflow of Curation Process using Bio-NLP Systems

The high-level process of training a model involves several key steps.

There is the field definition and manual curation phase, where the field to be extracted is defined, and guidelines are created to generate training data. This data undergoes a double-blinded review process to ensure its reliability for training task-specific models.
In the training phase, a large corpus of texts, such as publications, is preprocessed by dividing them into smaller paragraphs. These paragraphs are then used to train BERT models specific to the task.

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art natural language processing (NLP) model introduced by researchers at Google AI in 2018. It revolutionized the field of NLP by significantly advancing the understanding of contextual language representations.

After the models are trained, they are tested using a separate dataset known as test data.
The accuracy of the models is evaluated, and if they meet the desired performance standards, they can be used for curation in production. However, it is often necessary to iterate the training process or obtain more training data to improve the models. Typically, around 5 to 15 iterations are required to develop "production-ready models" ready to be deployed.
In the final stage, the extracted information is standardized using specific ontologies such as MeSH, PubChem, BTO, and others. This process yields "normalized entities," terms selected from a regulated dictionary.
Additionally, relevant entities are extracted from metadata, publications, and abstracts using the task-specific model that was trained earlier. To ensure consistency, the extracted entities are mapped to standard ontologies through a dedicated model explicitly trained. This model, called the "normalization" model, maps extracted entities to the corresponding ontologies.

There are two limitations to using the current process for developing models.

First, since one model can curate one field, which puts a development time constraint on how many new areas we can add time.
Secondly, BERT's architecture and size impose limitations, making it relatively less effective in understanding context and extracting information.

Enhancing Curation with ChatGPT

ChatGPT has two advantages over BERT.

ChatGPT is an LLM designed to follow instructions, making it flexible regarding the tasks it can perform.
Being an LLM, it has higher accuracy when performing said tasks.

One of the initial applications we explored with ChatGPT is its use in curating various fields using prompts. Our experimentation has revealed that ChatGPT performs better than BERT-based models while requiring significantly less development time.

Experiment 1

To evaluate this, we conducted an experiment on information extraction, specifically extracting disease information from samples within an omics dataset. We selected datasets from GEO (Gene Expression Omnibus) to create a test set for comparison.

Both BERT and ChatGPT were employed to extract disease labels. For BERT, we utilized a custom pipeline designed explicitly for disease extraction. On the other hand, with ChatGPT, we used a prompt that provided instructions on the process of extracting disease from the metadata.

BERT demonstrated significantly poorer performance than the results obtained using the developed prompt, and the development time required for creating and testing the prompt was notably shorter than that of BERT.

Experiment 2

In the second experiment, the objective was a classification task involving identifying the presence or absence of a donor in a given experiment. BERT proved to be highly effective in this task, yielding excellent results. The experimental setup remained consistent, utilizing a fine-tuned BERT model alongside ChatGPT with a prompt, facilitating a direct comparison between the two.

Despite the relatively long development time required, ChatGPT emerged as the superior choice in this scenario, outperforming BERT. Additionally, ChatGPT offered the added advantage of being independent of the data source used for testing the model, thereby reducing development time in the long run.

The potential impact of ChatGPT is substantial, with the prospect of significant time and resource savings on the horizon. While the technology is still in its early stages, the future looks promising as ChatGPT holds the key to unlocking novel possibilities and enhancing efficiency within the curation workflow at Elucidata. By harnessing the power of this advanced language model, the company stands to experience transformative changes in its information extraction endeavors.

Understand more about our ML-ready omics datasets and discover how our innovative solutions can optimize your research workflows. Together, let's unlock new frontiers in data-driven drug R&D.

Book a demo to learn more!

‍

Other Resources

Blogs Case Studies Dataset Roundup Documentation Glossary Webinars Whitepapers

Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.

Oops! Something went wrong while submitting the form.

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

View Video

How does Polly help in training classifier models for patient stratification?

View Video

How does Polly assist in defining genetic signatures for different stages of cell differentiation?

View Video

What is the process of creating a disease-specific atlas using Polly’s harmonization engine?

View Video

How does Polly integrate multiple data types for more reliable patient stratification?

View Video

Can Polly handle data quality issues and unstructured data from public repositories?

View Video

How does Polly harmonize multi-omic datasets to improve the quality of patient stratification?

View Video

How does Elucidata's Polly help in overcoming the challenges of patient stratification?

View Video

What challenges do researchers face when performing patient stratification using multi-omics data?

View Video

What is patient stratification, and why is it important for precision medicine?

View Video

What are the key advantages of using Polly for transcriptome profiling and biomarker identification?

View Video

Upcoming Webinar - Accelerate Diagnostic Product Development with Scalable & Accurate AI-Ready Clinical Data Pipelines

Register now

[Upcoming Webinar] Scaling High-Quality Data Processing: Achieve 4x Cost Reduction for Foundation ModelsRegister Now->

Reserve Your Seat

Pharma Company Achieves 4x Faster Target Identification for Inflammatory Disease

Key Highlights

What’s a Rich Text element?

Static and dynamic content editing

How to customize formatting for each rich text

All Solution Briefs

Other Resources

ChatGPT in Drug Discovery

Why is Data Curation Important?

Data Curation Using BioNLP Systems (Before ChatGPT)

Enhancing Curation with ChatGPT

Experiment 1

Experiment 2

Other Resources

Talk to our Data Expert

More Solution Briefs

Faster Insights on Omics Data Signatures with Polly Discover

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

How to Perform Patient Stratification on Polly

ChatGPT in Drug Discovery

Solving Biomedical Data Findability Issues Using Polly

How to Compare Gene Signatures on Polly

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

How does Polly help in training classifier models for patient stratification?

How does Polly assist in defining genetic signatures for different stages of cell differentiation?

What is the process of creating a disease-specific atlas using Polly’s harmonization engine?

How does Polly integrate multiple data types for more reliable patient stratification?

Can Polly handle data quality issues and unstructured data from public repositories?

How does Polly harmonize multi-omic datasets to improve the quality of patient stratification?

How does Elucidata's Polly help in overcoming the challenges of patient stratification?

What challenges do researchers face when performing patient stratification using multi-omics data?

What is patient stratification, and why is it important for precision medicine?

What are the key advantages of using Polly for transcriptome profiling and biomarker identification?

What methodologies does Polly use to identify synergistic drug combinations?

How does Polly rank datasets similar to a gene signature query?

What steps are involved in creating a query gene signature on Polly?

How does Polly's RNA-Seq Atlas simplify gene signature analysis?

What is gene signature comparison, and why is it important in drug discovery?

Get the latest news, industry insights, and updates delivered directly to your inbox.

All Solution Briefs

Faster Insights on Omics Data Signatures with Polly Discover

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

How to Perform Patient Stratification on Polly

ChatGPT in Drug Discovery

Solving Biomedical Data Findability Issues Using Polly

How to Compare Gene Signatures on Polly

info@elucidata.io

info@elucidata.io

info@elucidata.io