FAIR Data

GPT & Drug Discovery: Rise of Generative Model

Jayashree
December 19, 2023

In the quest for innovative solutions to some of the world's most pressing health challenges, the convergence of cutting-edge technology and scientific research has given birth to a revolutionary field in pharmaceuticals and biotechnology: the use of generative models, with GPT (Generative Pre-trained Transformer) leading the charge.

At our annual event DataFAIR 2023, we had the privilege of hosting an insightful panel discussion focused on the burgeoning impact of generative models in this critical domain. Here are the panel details-

GPT & Drug Discovery: Rise of Generative Model

Our panelists explored the diverse applications of these models across various facets of drug discovery, including genomics, structural biology, and protein design.

One key theme that emerged from the discussion was the central role of data quality and its profound influence on the successful utilization of generative models. Here we have summarized some key concepts that came up during the discussion.

Q: In light of the powerful tools such as AlphaFold, Llama models, and GPT models that often seem like magic, is there still a necessity to verify and curate data?

[Panel]: In a world where cutting-edge tools like AlphaFold, Rama models, and GPT appear to work like magic, it's tempting to believe they can handle any input or data we throw at them flawlessly. However, the reality is more nuanced. When we feed these models with unclean or poorly curated data, they learn from that data and produce outputs that mirror its quality. In simple terms, garbage in, garbage out.

To mitigate this issue, some opt to train these models with vast amounts of data, assuming that while some of it may be unclean or inaccurate, the majority will be reliable. But this assumption doesn't always hold true. Instead of focusing on quantity, a smarter approach is to prioritize data quality. It's about ensuring data is findable, accessible, interoperable, and usable (FAIR), making it consistent and clean for both humans and AI models.

The same principle applies to training the models. By using clean and consistent data, even in smaller quantities, we can enhance their learning. In summary, verifying data before training or fine-tuning models leads to more accurate and relevant results. This, in turn, builds trust in the models, saving researchers valuable time and ultimately making humans more efficient with these powerful tools.

While these AI models may seem like magic, they are only as good as the data they're trained on. Data quality remains a crucial factor in harnessing their true potential.

Q: What strategies can be devised to integrate generative models, foundational models, or large language models into existing R&D systems or workflows?

[Panel]: The current focus largely centers around data summarization. This involves creating conversational agents that engage with vast datasets, whether in the form of text, regulatory documents, images, and more. However, the true potential of generative AI lies in understanding the language of biology.

Recent innovations, such as AlphaFold and BioBERT, have empowered us to generate new protein structures and small molecules. One fascinating avenue that's gaining traction is the development of foundational models for target discovery. These models leverage multimodal data to identify potential therapeutic targets across various domains.

In the realm of generative AI, the most exciting prospects lie in drug discovery and research and development (R&D). This field is poised to witness a surge in generative models, foundational models, and large language models in the coming years.

Yet, it's essential to remember that not all problems necessitate a massive model with a trillion parameters. Smaller models can often suffice, and it's crucial to identify the right model for the specific task. Once you've pinpointed the model that fits, incremental improvements can be made throughout the R&D process, from simplifying data annotation for scientists to automating systematic reviews of entire research areas.

The key takeaway here is to match the task with the appropriate model and gradually enhance the process to yield meaningful results. This approach ensures efficiency and precision in tackling research challenges.

However, one challenge is not to forget that not all problems require extensive models. Simpler, smaller models can often provide accurate solutions, particularly when coupled with data quality and smart task alignment. For instance, pharmaceutical companies are inclined to adopt specialized models tailored to their specific needs. While task-specific models can be advantageous, it's vital to maintain data transparency and interoperability to foster knowledge-sharing within the scientific community. This ensures that the benefits of generative AI are not constrained by task-specific silos.

Q: Would adopting a federated learning approach, where models are trained collaboratively across various organizations, offer any advantages to your closed-company training environment? Could this method indirectly leverage the accumulated data from other companies for your benefit?

[Panel]: Indeed, the concept of federating generative AI models across organizations, companies, and institutions is gaining momentum and is a compelling avenue to explore. There is a growing number of companies, vendors, and institutions actively engaged in this domain, working towards collaborative training environments. This approach signifies a promising way forward, enabling the consolidation of knowledge from diverse sources.

Taking it a step further, it's worth considering whether federated learning could extend to the edge, particularly for pocket-sized large language models. These models, designed to operate on smartphones, are already a reality today, and the potential applications are virtually limitless.

However, it's essential to navigate this territory carefully, as it raises complex questions about data ownership, access, and control. Who can utilize specific data, and who holds ownership rights? The landscape is intricate, and resolving these challenges will be an ongoing process.

One potential approach within federated learning is to allow models to travel to the data, rather than the data leaving its original location. This way, sensitive and confidential data can remain secure within its premises while enabling models to glean specific insights from the data and return with the knowledge gained. This approach is not ideal, but it offers a compromise that respects data privacy and security.

The core principle that underlines this discussion is the importance of keeping data grounded where it originates while allowing secondary uses, such as AI, to have a means of access. Simply requesting organizations to relinquish their confidential data and placing trust in external entities is no longer a viable solution. The focus should be on fostering trust through a distributed and secure approach.

Furthermore, we've witnessed unfortunate incidents where sensitive data has been mishandled. The emphasis should be on safeguarding scientific data, medical records, and other critical information to prevent lapses in security and confidentiality.

In summary, the prospect of federated learning for generative AI models presents intriguing possibilities, but it necessitates a thoughtful, ethical, and secure approach to data management and collaboration.

Q: Understanding the Application of Large Language Models and Foundational Models:

[Panel]: The assumption that large language models will extract knowledge from the data doesn't account for their inability to recognize gaps in their knowledge. Instead, they always attempt to fill those gaps, which is counterproductive for us. Those gaps in knowledge are precisely what we need to identify, as they highlight the areas where we lack a clear understanding.

In the realm of explaining the precise mechanism of action for a drug, there may be components that remain disconnected from others. These gaps are of paramount importance as they signify where further experimentation is required. To bridge these gaps, one needs to plan and execute experiments in the lab, connecting elements that were previously disjointed.

I see an analogy here. Many individuals use chatbots or AI models to generate content, such as essays or poems. It's relatively straightforward to verify these outputs. If an essay has gaps or inaccuracies, a human can review and rectify them. However, in the domain of biomedical research, verifying the insights provided by these models is far more complex. It may take weeks or even months to determine their accuracy and identify the nature of any gaps. Therefore, human expertise remains indispensable for assessing, filling, and validating these gaps.

In reality, we should never aim to completely eliminate human involvement. Our expertise and oversight are essential for validating and understanding the output of generative AI models. There is a symbiotic relationship between human expertise and AI, and they complement each other in the pursuit of scientific advancement.

Q: Is it necessary to develop a distinct model specifically designed to identify gaps within the existing body of knowledge?

[Panel]: It's important to recognize that no model is perfect, and there will always be limitations and challenges in developing such systems. While these models can certainly help improve the quality of AI-generated content, they may not be able to capture all the nuances or gaps. The need for human expertise and intervention is still crucial for verifying, correcting, and enhancing AI-generated outputs.

So, while AI models can be valuable enablers and catalysts in various domains, including scientific research, they are not a replacement for human expertise and critical thinking. Instead, they should be viewed as tools that work alongside humans to enhance decision-making, provide insights, and assist in filling knowledge gaps.

During our annual event, DataFAIR 2023, we delved deeper into the realm of Generative AI. Watch the event here to tap into the insights of leading experts in the Biopharma/Biotech industry as we explore the groundbreaking possibilities of Generative AI in the field of drug discovery. Don't miss out on the opportunity to learn from the best minds in the business!

Reach out to us at info@elucidata.io for more information or watch the full panel discussion here.

Blog Categories

Blog Categories

Request Demo