ChatGPT in Drug Discovery

Vishal Samal, Shrushti Joshi
July 5, 2023
ChatGPT in Drug Discovery

In today's digital age, where information is abundant and constantly evolving, curation has become more crucial than ever. Curators are the gatekeepers of the vast sea of knowledge, sifting through the overwhelming volume of data to deliver relevant, valuable, and engaging content to their audience. But with the exponential growth of information, curators face the challenge of efficiently and effectively navigating this vast landscape.

Enter ChatGPT, an AI-powered language model developed by Open AI and launched in November 2022.

With its advanced natural language processing capabilities, ChatGPT empowers curators to streamline their processes, enhance their findability, evaluate and organize information, and ultimately provide an elevated experience for their readers. ChatGPT has captivated the public's imagination like few other innovations. Its progress has surprised the machine learning community, surpassing the previous benchmark set by BERT models released in 2018.

”It won’t be a surprise to see, in the next 24 months, multiple billion-dollar companies built on top of OpenAI’s foundational models. The startups that will be the most successful won’t be the best at prompt engineering, which is the focus today; instead, success will be found in what novel data and use cases they incorporate into OpenAI’s models. This anonymous data and application will be the moat that establishes the next set of AI unicorns.” ~David Shim

The Life Sciences community is particularly interested in understanding the implications of ChatGPT for their work. In this blog, we dive into the world of curation and explore how ChatGPT can revolutionize this practice.

Why is Data Curation Important?

As public data repositories accept data in flexible arrangements, significant variations arise in how data is submitted. Consequently, intelligent systems are necessary to extract and categorize pertinent information from the provided metadata on these repositories.

Elucidata specializes in ingesting and providing omics datasets from diverse sources in standardized machine learning (ML)-ready formats to expedite drug development.

To address this challenge, Elucidata employs Biological Natural Language Processing (BioNLP) systems to curate its platform’s vast array of metadata, thereby automating the process. This system comprises two primary components:

  • One responsible for extracting relevant information from public data,
  • Other for harmonizing the data to a standardized vocabulary.

By leveraging these BioNLP systems, Elucidata establishes a consistent format across all its diverse data sources, significantly reducing the effort required to render public data usable. This standardized approach not only enhances the efficiency of data curation but also contributes to accelerating research and analysis in the field of drug development.

Data Curation Using BioNLP Systems (Before ChatGPT)

To streamline and standardize the curation process, we employ Bio-NLP systems to extract relevant entities from metadata, abstracts, and publications, automating the process effectively.

ChatGPT in Drug Discovery
Workflow of Curation Process using Bio-NLP Systems

The high-level process of training a model involves several key steps.

  • There is the field definition and manual curation phase, where the field to be extracted is defined, and guidelines are created to generate training data. This data undergoes a double-blinded review process to ensure its reliability for training task-specific models.
  • In the training phase, a large corpus of texts, such as publications, is preprocessed by dividing them into smaller paragraphs. These paragraphs are then used to train BERT models specific to the task.

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art natural language processing (NLP) model introduced by researchers at Google AI in 2018. It revolutionized the field of NLP by significantly advancing the understanding of contextual language representations.

ChatGPT in Drug Discovery
Workflow of Normalization Process
  • After the models are trained, they are tested using a separate dataset known as test data.
  • The accuracy of the models is evaluated, and if they meet the desired performance standards, they can be used for curation in production. However, it is often necessary to iterate the training process or obtain more training data to improve the models. Typically, around 5 to 15 iterations are required to develop "production-ready models" ready to be deployed.
  • In the final stage, the extracted information is standardized using specific ontologies such as MeSH, PubChem, BTO, and others. This process yields "normalized entities," terms selected from a regulated dictionary.
  • Additionally, relevant entities are extracted from metadata, publications, and abstracts using the task-specific model that was trained earlier. To ensure consistency, the extracted entities are mapped to standard ontologies through a dedicated model explicitly trained. This model, called the "normalization" model, maps extracted entities to the corresponding ontologies.

There are two limitations to using the current process for developing models.

  • First, since one model can curate one field, which puts a development time constraint on how many new areas we can add time.
  • Secondly, BERT's architecture and size impose limitations, making it relatively less effective in understanding context and extracting information.

Enhancing Curation with ChatGPT

ChatGPT has two advantages over BERT.

  1. ChatGPT is an LLM designed to follow instructions, making it flexible regarding the tasks it can perform.
  2. Being an LLM, it has higher accuracy when performing said tasks.

One of the initial applications we explored with ChatGPT is its use in curating various fields using prompts. Our experimentation has revealed that ChatGPT performs better than BERT-based models while requiring significantly less development time.

Experiment 1

To evaluate this, we conducted an experiment on information extraction, specifically extracting disease information from samples within an omics dataset. We selected datasets from GEO (Gene Expression Omnibus) to create a test set for comparison.

Both BERT and ChatGPT were employed to extract disease labels. For BERT, we utilized a custom pipeline designed explicitly for disease extraction. On the other hand, with ChatGPT, we used a prompt that provided instructions on the process of extracting disease from the metadata.

ChatGPT in Drug Discovery
Accuracy of Sample Level Disease Labels
ChatGPT in Drug Discovery
Development Time for Sample Level Disease Labels

BERT demonstrated significantly poorer performance than the results obtained using the developed prompt, and the development time required for creating and testing the prompt was notably shorter than that of BERT.

Experiment 2

In the second experiment, the objective was a classification task involving identifying the presence or absence of a donor in a given experiment. BERT proved to be highly effective in this task, yielding excellent results. The experimental setup remained consistent, utilizing a fine-tuned BERT model alongside ChatGPT with a prompt, facilitating a direct comparison between the two.

Despite the relatively long development time required, ChatGPT emerged as the superior choice in this scenario, outperforming BERT. Additionally, ChatGPT offered the added advantage of being independent of the data source used for testing the model, thereby reducing development time in the long run.

ChatGPT in Drug Discovery
Accuracy of Donor Labels
ChatGPT in Drug Discovery
Development Time for Donor Labels

The potential impact of ChatGPT is substantial, with the prospect of significant time and resource savings on the horizon. While the technology is still in its early stages, the future looks promising as ChatGPT holds the key to unlocking novel possibilities and enhancing efficiency within the curation workflow at Elucidata. By harnessing the power of this advanced language model, the company stands to experience transformative changes in its information extraction endeavors.

Understand more about our ML-ready omics datasets and discover how our innovative solutions can optimize your research workflows. Together, let's unlock new frontiers in data-driven drug R&D.

Book a demo to learn more!

Other Resources

Talk to our Data Expert
Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.
Oops! Something went wrong while submitting the form.

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

  • Data-Driven Target Selection: Polly integrates multi-omics data to identify key genes relevant to patient subgroups.
  • Accelerated Drug Discovery: The platform prioritizes targets based on disease associations and biomarker relevance, expediting the discovery and validation process.
  • Improved Reproducibility: Harmonized datasets ensure reliable and reproducible findings for target validation.

How does Polly help in training classifier models for patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly provides pre-processed, harmonized datasets that enable AI/ML model training for patient classification. It supports feature selection, dimensionality reduction, and validation workflows to build robust predictive models for precision medicine applications.

How does Polly assist in defining genetic signatures for different stages of cell differentiation?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly analyzes both single-cell and bulk multi-omics data to identify stage-specific genetic markers. By applying machine learning algorithms to detect patterns in gene expression, Polly helps researchers map lineage differentiation and gain insights into disease progression.

What is the process of creating a disease-specific atlas using Polly’s harmonization engine?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly builds disease-specific atlases by:

  1. Aggregating multi-omics datasets from curated sources.
  2. Harmonizing data using standardized ontologies.
  3. Annotating datasets with clinical metadata.
  4. Structuring the information into disease-specific cohorts for targeted biomarker and therapeutic research.

How does Polly integrate multiple data types for more reliable patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly integrates genomics, transcriptomics, proteomics, and clinical data into a unified, multi-dimensional view of patient populations. This helps researchers uncover complex biological relationships and enhances predictive modeling for patient subgroups.

Can Polly handle data quality issues and unstructured data from public repositories?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Yes, Polly automatically processes raw, unstructured data from public sources, addressing missing values, batch effects, and inconsistencies. Its machine learning–driven pipelines filter out noise and standardize data, ensuring higher-quality datasets for seamless analysis.

How does Polly harmonize multi-omic datasets to improve the quality of patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly's harmonization engine normalizes, processes, and integrates diverse datasets using standard ontologies and metadata frameworks. This ensures consistency, removes batch effects, and enhances the reliability of downstream analyses for precise patient classification.

How does Elucidata's Polly help in overcoming the challenges of patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly streamlines patient stratification by:

  • Harmonizing and Integrating Multi-omics Data: Polly standardizes data across different sources, making it analysis-ready.
  • Curating High-quality Datasets: The platform ensures datasets are clean, structured, and well-annotated, thereby improving the reliability of downstream analyses.
  • Enabling AI-driven Insights: Polly applies machine learning models to uncover patterns and classify patients effectively.
  • Ensuring Reproducibility and Scalability
  • Automated pipelines and version-controlled workflows allow for efficient scaling to large datasets while maintaining detailed records of each analysis step, making it easier to reproduce or modify results.

What challenges do researchers face when performing patient stratification using multi-omics data?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Researchers encounter several challenges, including:

  • Data Heterogeneity: Multi-omics data come from different platforms, making integration complex.
  • Data Quality Issues: Public datasets often contain missing values, noise, or inconsistencies.
  • Computational Complexity: Large-scale multi-omics data require significant computational power and expertise to process.
  • Interpretability: Even with powerful analytical methods, extracting clear and meaningful biological insights from high-dimensional data remains a significant challenge.

What is patient stratification, and why is it important for precision medicine?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Patient stratification is the process of categorizing patients into subgroups based on genetic, molecular, or clinical characteristics. This approach is crucial for precision medicine because it identifies which patient populations are most likely to respond to specific treatments, thereby improving therapeutic outcomes and reducing the risk of adverse effects.

What are the key advantages of using Polly for transcriptome profiling and biomarker identification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly provides access to a curated repository of RNA-seq datasets that are consistently processed and enriched with metadata. This harmonization allows researchers to efficiently search for datasets with similar transcriptional profiles, facilitating transcriptome profiling and biomarker identification.

What methodologies does Polly use to identify synergistic drug combinations?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly utilizes signature reversal and multivariate gene expression signatures to predict potential drug combinations. By analyzing publicly available transcriptomics data and drug signatures, Polly can identify drugs or compounds that may have therapeutic effects by reversing disease signatures.

How does Polly rank datasets similar to a gene signature query?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly ranks similar datasets using cosine similarity scores, which measure how closely a dataset's transcriptional profile matches the query signature. This helps researchers quickly find relevant datasets for further analysis and validation.

What steps are involved in creating a query gene signature on Polly?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Researchers define the biological process of interest, select a dataset, preprocess the data, identify differentially expressed genes, and validate the signature. Polly’s platform streamlines this process with expert support and ML-ready datasets.

How does Polly's RNA-Seq Atlas simplify gene signature analysis?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly's RNA-Seq Atlas addresses challenges in extracting associated signatures from public databases by providing a curated resource of RNA-seq datasets collected from the Gene Expression Omnibus (GEO). This richly curated resource helps researchers to find datasets with similar transcriptional profiles to their gene sets of interest.

What is gene signature comparison, and why is it important in drug discovery?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Gene signature comparison analyzes gene expression patterns to identify disease-related signatures. It helps researchers find drugs that can reverse disease signatures, aiding in therapeutic discoveries.