Creating AI-Ready Datasets for Foundation Models in Life Sciences R&D

Introduction

When an apple fell on Sir Isaac Newton’s head, it sparked the discovery of fundamental physical laws, improving our understanding of the physical world. Since then, biologists have sought to uncover universal principles to explain the complexities of life. However, the immense diversity of biological systems make it difficult to establish overarching rules applicable across all contexts. Over time, biologists have come to accept that the Newtonian apple may never fall.

Yet, biological research has followed a winding path with its own set of triumphs and pitfalls. Within this evolving landscape, Artificial Intelligence (AI) has emerged as a powerful tool for addressing complex biological questions. AI not only automates time-consuming tasks but also has the potential to change the scientific method itself. Early AI models were specialists, trained to perform specific tasks with high accuracy. In contrast, foundation models, the latest iteration of AI models, are generalists. Trained on vast datasets, they can perform multiple tasks, significantly enhancing their utility. These models can identify hidden patterns within data, accelerating scientific discovery by bypassing traditional hypothesis-driven research and enabling in silico validation, ultimately saving time and effort.

Two features equip foundation models with such transformative power - 1) they are pretrained on large and diverse types of datasets, and 2) they can be fine-tuned for various downstream applications. Their exposure to vast amounts of information allows them to develop a broad understanding of biological data, while their advanced computational power and transfer learning capabilities make them highly adaptable for diverse tasks.^[1]

In this blog, we will explore the role of foundation models in life sciences research, the need for generating AI-ready datasets, the challenges in creating AI-ready datasets and how Elucidata is addressing these challenges to support both biologists and data scientists.

Properties of Foundation Models

Foundation models were originally developed within the field of Natural Language Processing, driven by advancements in model architectures and a shift in the machine learning paradigm.^[1] These models have since found applications in life sciences, particularly driving insights in drug and biomarker discovery, personalized medicine, clinical trials, and research in spatial biology.

Model Architecture

AI models have evolved significantly over the years, from traditional machine learning algorithms that were based on rule-based learning, to deep learning models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). While CNNs excelled in processing spatial data like images, and RNNs were designed to handle sequential data such as time series and text, both had limitations in capturing long-range dependencies and contextual relationships in complex datasets.

The advent of transformer models that function on self-attention mechanisms allowed models to weigh the importance of different input elements relative to each other. This capability enabled transformers to process entire datasets in parallel, capturing relationships across long sequences more effectively than previous architectures. Key transformer-based models such as Bi-drectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT) provided the framework for the development of more foundation models, which generalize across various domains, including life sciences.

Deep Learning paradigms

‍Another key aspect of foundation models is their reliance on transfer learning and self-supervised learning. Transfer learning enables a model trained on one task (or dataset) to adapt the same concept for a different but related task with minimal additional training. For example, a model pretrained on genomic data can be fine-tuned to identify specific biomarkers for diseases, requiring only a limited amount of new, labeled data, accelerating progress.

Self-supervised learning (SSL) allows a model to learn patterns from raw, unlabeled data by creating its own supervision signals. Rather than relying on manual annotations, SSL identifies structures within the data itself using pre-text tasks, such as predicting missing elements or reconstructing inputs. Common SSL approaches include contrastive learning (learning by distinguishing between similar and dissimilar data), masked prediction (for example, BERT, which learns by predicting missing words in text) and denoising autoencoders (learning to reconstruct corrupted inputs). In life sciences, SSL enables models to learn from millions of protein sequences, capturing meaningful biological relationships without requiring extensive manual labeling.

Pre-training on Multi-Modal Data

‍A key feature of recent foundation models is their ability to learn from diverse data modalities, including text, video, images, and biological sequences. This capability is particularly important in spatial biology, where multi-modal data integration is fundamental. For instance, information about cellular features and spatial context is provided in the form of images, while multi-omics profiling provides information about molecular features in the form of data matrices. Foundation models like Starfysh and SpatialGlue use this rich multi-modal data to generate deeper insights into spatial biology research, with specific applications in precision medicine.

Role of Foundation Models in life science R&D

Foundation models have enabled groundbreaking applications in life sciences^[2], including, but not limited to:

Structure Construction: Protein Language models like AlphaFold2 have almost solved protein structure prediction, accurately determining 3D structures from amino acid sequences, which is crucial for understanding protein function and drug design.
DNA and RNA Sequence Analysis: Transformer-based models like DNABERT and Uni-RNA are used to predict genomic elements and understand transcriptional regulation, aiding in gene annotation and variant effect prediction.
Single-Cell Analysis: Foundation models like scBERT and scFoundation process single-cell RNA sequencing data to generate embeddings that capture gene-gene and cell-cell relationships, facilitating cell type identification and functional annotation.
Virus Discovery: Models like ViraLM analyze viral genomes to identify novel viruses and understand viral evolution, enhancing pathogen detection and surveillance.
Applications in Medicine and Healthcare: Foundation models are significantly advancing healthcare by improving diagnostics and treatment predictions. GMAI^[3] can analyze medical images in combination with patient data to detect diseases at early stages. Med-PaLM is designed to interpret complex medical literature and provide evidence-based recommendations, aiding healthcare professionals in making informed decisions.
Microbiome Research: Foundation models apply transfer learning techniques to microbiome data, improving the understanding of microbial communities and their impact on human health.

The role of data in Foundation Model pretraining

Foundation models intrinsically learn to perform tasks, by devouring large and diverse datasets. For example, GPT-3 developed in-context learning, i.e. learning by interpreting text explanations given in prompts, almost by accident. This is the exciting, yet concerning aspect about foundation models: one cannot predict what capabilities they may develop. The possibilities are endless, and while that can solve unseen problems, AI, law and ethics experts are equally worried about creating and pervading additional problems.

This brings us to the core issue: if foundation models are trained on biased, incomplete, and inconsistent datasets, they will internalize these flaws, leading to inaccurate predictions and unreliable analyses. The flexibility of these models further amplifies the problem, as their application across diverse fields like healthcare and law risks perpetuating existing inequities rather than addressing them.

Within life sciences, incomplete and biased datasets can significantly impact drug discovery and medical healthcare pipelines. For instance, medical datasets have historically been skewed, with women being severely underrepresented. Similarly, genomic data is often biased toward certain racial groups, leading to an incomplete and inaccurate understanding of the human genome. This can hinder drug screening pipelines from the outset.

On the flip side, richer and diverse training datasets add to the capabilities of foundation models. For example, DNABERT, a DNA-language model trained only on the human genome works very well for predicting promoter regions and transcriptional factors from DNA sequences, but performs poorly while detecting genetic variants. However, the Nucleotide Transformer, trained on the genomes of 850 species, is excellent at identifying genetic variants and predicting mutation effects^[4].

One way by which AI developers can circumvent this problem, is by regulating the quality and type of data that is being fed to the models. Data quality and readiness for AI have to be monitored across multiple steps of data delivery, and this is where Elucidata can help. By delivering high-quality AI-ready datasets across easily usable modules, Elucidata has the expertise to fast-track life science R&D pipelines.

What are AI-Ready Datasets?

AI-ready datasets are structured, cleaned, and harmonized data that meet the specific requirements of AI models. Unlike raw data, which is fragmented, inconsistent, and unstructured, AI-ready datasets are meticulously curated to ensure their quality and usability for diverse AI-driven applications.

Key characteristics of AI-ready datasets include:

Consistency: Uniform formatting, standardized terminologies, and removal of redundancies.
High Quality: Accurate, complete, and relevant data minimizing noise and errors.
Harmonized Metadata: Comprehensive metadata annotation allows models to contextualize and interpret data effectively.
Multimodal Integration: Combining data from different sources such as genomics, proteomics, imaging, and clinical records which leads to better model training.

Steps to Build AI-Ready Datasets

1. Data Retrieval

As mentioned previously, the richer the data, the better the AI model. Thus, the first step is sourcing high-quality datasets from both public and in-house repositories. Public repositories like GenBank, PubMed, and TCGA provide valuable open-access data on genomics, proteomics, and clinical research. In-house datasets, often generated from proprietary experiments or clinical trials, bring unique value by adding specificity to the model's fine-tuning for downstream applications.

2. Data Harmonization

Harmonization ensures that data from heterogeneous sources is standardized and ready for integration. Biomedical data exists in various formats- images, sequences, text, or tabular data each with its metadata structure. Harmonization involves 1) converting data into uniform formats (e.g., JSON, CSV), 2) standardizing terminologies using ontologies like UMLS or GO, and 3) integrating multi-modal data, such as combining omics data with imaging and clinical records, to provide a comprehensive training dataset.

3. Metadata Curation

Metadata serves as the blueprint of AI-ready datasets, allowing models to interpret data meaningfully. Domain-specific metadata curation involves multiple aspects. Datasets are labeled with detailed descriptions of experimental conditions, sample types, and study outcomes, and ensured that controlled vocabularies and standardized annotations are used to reduce ambiguity. Further, lineage information is incorporated to track data provenance and ensure transparency in downstream applications.

4. Human-in-the-Loop Validation

While automation speeds up data processing, human-in-the-loop validation ensures accuracy and reliability. Essentially, human experts review annotated data to correct errors and refine context. They also validate model-ready datasets through pilot runs, and identify gaps and inconsistencies. Moreover, iterative feedback loops between domain experts and data scientists optimizes dataset quality in real-time.

By following these steps, organizations can build robust AI-ready datasets that meet the high standards required for foundation models. This ensures foundation models are trained on comprehensive and accurate data.

Challenges in Creating AI-Ready Datasets

While the richness and diversity in data modalities contribute to model performance, their lack of consistent structure and fragmented nature are significant challenges. In many ways, while we have a lot to teach the AI models, we must be prepared to do so in the format they best understand.

1. Fragmented Public and Proprietary Data Sources

Biomedical research data is scattered across numerous public repositories, such as GenBank, PubMed, and clinical trial databases, as well as proprietary in-house datasets. These sources vary widely in terms of data quality, accessibility, and formatting. Integrating these disparate sources into a cohesive, AI-ready dataset is a time-consuming task that requires advanced tools and methodologies. Furthermore, proprietary data often has restricted access, adding a layer of complexity in ensuring compliance with data sharing and privacy regulations.

2. Inconsistent Data Processing Across Platforms

Different platforms and research groups often employ different methods for data collection, preprocessing, and storage. This lack of uniformity leads to inconsistencies that can result in ineffective model training. For example, gene expression data from 10x Genomics is typically stored as sparse matrices in formats like matrix.mtx, features.tsv, and barcodes.tsv, whereas Smart-seq data is presented as dense matrices in formats such as CSV or TSV, with full-length transcript coverage. These structural differences, along with varying normalization methods across platforms, make direct comparisons challenging. Managing these inconsistencies requires extensive preprocessing and harmonization efforts.

3. Lack of Standardized Metadata, Annotations, and Data Heterogeneity

Biomedical datasets often suffer from a lack of standardized metadata and annotations, which leads to ambiguities in critical information such as experimental conditions, sample types, and methodologies. This issue is exemplified by the heterogeneity of data types, ranging from genomic sequences and proteomics profiles to imaging data and clinical records, across tissues and experimental conditions. Furthermore, genomic data from multiple species enriches training data but also increases heterogeneity, adding to the associated challenges. These inconsistencies not only complicate data integration but also impact the interpretability, reliability, and biological relevance of AI-ready datasets, making it a significant challenge in harmonizing complex, multi-modal data for research.

The Elucidata Advantage

With all these challenges, how can we optimize the creation of AI-ready datasets? Fortunately, Elucidata has the experts and tools to overcome these challenges.

Scalable Pipelines for Data Transformation

Our proprietary platform Polly is built around scalable data pipelines that automate the ingestion, preprocessing, and harmonization of vast and diverse datasets. These pipelines have the following features that equip them to handle multi-modal data from various sources, including clinical records, omics data, and imaging datasets:

Standardization Engines: Automated tools to normalize data formats, terminologies, and units.
Metadata Enrichment: Contextual annotation of datasets with domain-specific metadata.
Integration of Multi-Modal Data: Seamless harmonization of heterogeneous datasets, enabling foundation models to uncover complex relationships across modalities.

Harmonization Engines for Superior Dataset Quality

At the core of our offerings is Polly's robust harmonization engine, designed to manage the challenges of data fragmentation and inconsistency. This advanced technology integrates ETL (Extract, Transform, Load) pipelines with LLM-powered curation, processing approximately 5,000 samples and 10TB of data per week. Key capabilities include:

Data Audit: Evaluates data quality and relevance from diverse sources, including public databases and proprietary datasets.
LLM-Based Harmonization: Ensures consistency by mapping metadata to standardized ontologies such as UMLS and GO, providing a unified data schema with 99.99% accuracy through automated and human-in-the-loop validation.
Metadata Curation: Organizes and annotates datasets with up to 30 customizable metadata fields, ensuring they are AI-ready.
Quality Assurance: Applies more than fifty quality assurance checks, including schema compliance, ontology alignment, and error detection to maintain data integrity.
Atlas Repository: Stores harmonized data in a structured, AI-ready format, allowing seamless retrieval, visualization, and analysis.

Real-World Impact

Elucidata’s solutions have demonstrated measurable success in real-world applications. For example:

Improved Target Identification: Accelerated identification of drug targets through harmonized multi-omics datasets.
Enhanced Dataset Scalability: Harmonized millions of data points across modalities, ensuring models are trained on high-quality inputs.
Faster Insights: Reduced data preparation timelines significantly and helped research teams to meet critical project deadlines.

Conclusion

Technology has always found a way to counter the problems that face humankind. Just like the internet, initially developed for smoother military operations, completely changed the 21st century lifestyle, foundation models, initially developed for language processing, stand poised to influence life science research. Poor quality data stands in the way of life science research embracing the transformative potential of foundation models. We, at Elucidata, offer the biomedical expertise and advanced technology necessary to convert unstructured data into AI-ready datasets.