From Fragmentation to Integration: The Role of Structured Repositories in Biomedical Research

Atlas: Guide to Complex Landscapes

The word Atlas has a fascinating etymology. In Greek mythology, a Titan called Atlas was condemned to hold up the heavens above the Earth for eternity as punishment for his role in the war against Zeus and the Olympian Gods. Over time, this myth evolved into the image of Atlas carrying the Earth on his shoulders. In the early 16th century, cartographers began referring to collections of maps as atlases because, like the Titan, they carried representations of the entire world. Today, an atlas refers to a bound collection of maps, charts, and tables that detail various aspects of the physical world, such as political boundaries, topographical features, and climatic zones.

Similarly, within the realm of biomedical research, platforms like Elucidata's Atlas on Polly function as structured repositories that consolidate vast amounts of multi-modal biomedical data. These repositories integrate diverse datasets, including genomic, transcriptomic, proteomic, and clinical information, into a unified framework. By harmonizing and organizing this data, biomedical atlases offer researchers a comprehensive resource to explore complex biological systems, identify patterns, and derive insights that can drive advancements in healthcare and medicine.

The concept of an "atlas" in both geography and biomedical research, encapsulates comprehensive collections of information that guide exploration and understanding within a specific domain. Atlases serve as essential tools for navigation within their respective fields, transforming fragmented pieces of data into coherent, accessible, and usable knowledge.

Data challenges associated with Biomedical Research

Fragmented Data

Biomedical research generates vast amounts of data across genomics, proteomics, clinical trials, and imaging studies. Additionally, healthcare is undergoing a revolution with the push for open and personalized medicine, resulting in massive electronic health records (EHRs). However, these datasets are often stored in isolated repositories with incompatible formats, varying standards, and different access protocols. This fragmentation makes data integration difficult and time-consuming, forcing researchers to spend excessive time cleaning, formatting, and harmonizing information before meaningful analysis can begin. In fact, one of our clients complained that this process often used to take up a whole day, before our intervention. The lack of standardization limits collaboration, slows discoveries, and increases errors due to inconsistent metadata and manual curation.

Heterogeneous and Unstructured Data

Biomedical data comes in multiple formats, such as, structured (e.g., patient records, genomic sequences), semi-structured (e.g., XML-based clinical trial reports), and unstructured (e.g., physician notes, imaging scans, and research publications). Many valuable datasets like handwritten clinical notes or histopathology images, lack standardized structures, making automated processing and integration challenging. AI/ML models, which depend on clean and well-annotated data, struggle with unstructured inputs, leading to incomplete or biased outcomes. Without a framework to standardize and integrate these diverse data types, a significant portion of biomedical information remains underutilized.

High Volume and Complexity of Data

With advancements in high-throughput technologies, biomedical data generation has reached petabyte-scale volumes. Multi-omics experiments, high-resolution imaging, and longitudinal patient records contribute to the growing complexity of research data. A robust infrastructure for managing, processing, and retrieving such vast datasets is imperative to maximize opportunities for discovery.

The Solution: Structured Repositories

How can we resolve the challenges of fragmented and heterogeneous data, locked away in specialized silos?

Structured repositories represent a solution where data are easily findable, accessible, interoperable and reusable, i.e. they adhere to the FAIR principles.^[1] Structured repositories are unified data platforms for storing, curating, and integrating biomedical data. These repositories break down data silos, enhance interoperability across institutions and disciplines, and enable seamless data retrieval and analysis.

This can be best understood with an example. Consider clinical trial data, which are useful for biomedical research as well as healthcare. The knowledge of clinical trial results will inform both future biomedical research pipelines and personalized treatment plans. Yet, if these data are unavailable to either group, then scientific progress and real-world applications are hindered. This is where having a data repository specific to that particular domain would help, by ensuring that data is stored and curated to the standards required for both applications. If all that is known about a specific domain is collected within a unified repository, the benefits of scientific research and discovery are maximized.

Enabling Interdisciplinary Research and Collaboration

Modern life sciences rely on the integration of computational biology, AI, and experimental research. Structured repositories centralize diverse datasets, making it convenient for biologists, data scientists, and engineers to collaborate, analyze, and generate insights without struggling with fragmented data.

Increasing Scientific Reproducibility

Reproducibility in research depends on well-documented, shareable, and standardized data. Structured repositories ensure that datasets remain consistently annotated and accessible, allowing researchers to validate findings across different studies and institutions.

Ensuring Compliance with Data Security and Privacy Standards

Biomedical data are highly susceptible to data breaches, and regulations such as HIPAA, GDPR, and other ethical guidelines require stringent data security measures. If data systems are disorganized, compliance to these guidelines may become difficult. However, structured repositories incorporate robust security measures such as role-based access controls, metadata tracking, and audit logs to facilitate compliance while enabling secure data sharing at scale.

Powering AI/ML-Based Scientific Discovery

AI-driven drug discovery, biomarker identification, and disease modeling rely on clean, structured, and interoperable data. Structured repositories curate and harmonize datasets, ensuring that AI models are trained on high-quality inputs, leading to more accurate and reliable predictions.

Scalability for Large-Scale Data Management

With the increasing volume and complexity of biomedical data, structured repositories provide cloud-based, scalable infrastructure. This allows researchers to efficiently manage, retrieve, and analyze vast datasets, reducing time spent on data wrangling and maximizing the potential for discoveries.

Types of Structured Repositories in Biomedical Research

In principle, structured repositories can be categorized into four distinct types^[2], each serving unique purposes and catering to specific research needs:

Domain-Specific Repositories: These repositories focus on particular data types or research disciplines. For instance, they may specialize in storing protein structures, nucleotide sequences, or clinical data related to specific fields like oncology or neurology. By concentrating on specialized data, domain-specific repositories become central hubs for researchers interested in these areas.
Generalist Repositories: These repositories accept data across various types, formats, and disciplines. They are inclusive, accommodating data regardless of its content or the researcher's institutional affiliation. The National Institutes of Health (NIH) has established agreements with several generalist repositories under the Generalist Repository Ecosystem Initiative (GREI).
Project-Specific Repositories: These are tailored to store data generated from specific projects or collaborations. An example is the NIH's "All of Us" initiative. Such repositories enable data sharing and reuse by making project-specific data accessible to other researchers.
Institutional Repositories: These repositories are designed to store data primarily created by members of a specific institution or a consortium of institutions, including principal investigators, postdoctoral researchers, and students. Depending on the institution's mission, these repositories may function similarly to domain-specific or generalist repositories, collecting data from various projects within the institution.

Each type of repository offers distinct advantages and services:

Community Engagement: Domain-specific repositories often engage closely with their research communities, relying on advisory boards of subject matter experts to ensure data alignment with field standards. In contrast, generalist repositories serve a broader audience and may have less intensive user engagement at the content level.
Curation: All repositories employ data curation processes, but the extent varies. Domain-specific repositories may implement rigorous curation to adhere to field-specific standards, enhancing data interoperability and reusability. Generalist repositories focus on metadata standardization to improve data findability and accessibility.
Preservation: Ensuring long-term data preservation is a shared responsibility among all repository types. This includes adapting to evolving user needs, technological advancements, and format changes to maintain data usability over time.
User Diversity: Generalist repositories are designed to accommodate a diverse user base across multiple disciplines and skill levels, providing resources for both introductory learning and advanced research. Domain-specific and project-specific repositories cater to specialized communities seeking detailed information within their fields.
Services: While all repositories offer core services such as data ingestion, management, and access, domain-specific repositories may provide additional tools like analysis and visualization resources tailored to their specific research communities. Generalist repositories may offer broader services to support a wide range of data types and user needs.

Elucidata's Atlas: A Modality-Specific Approach

At Elucidata, we strive to develop solutions that power AI-driven biomedical discoveries. Our specialized and generalist data repositories, called Atlas, are designed to address challenges in storing, harmonizing, and analyzing large-scale biomedical data. Atlas function as structured repositories that integrate clinical and molecular datasets while maintaining data integrity, enabling both human exploration and AI-driven analysis.

Atlas is built on Polly’s robust data infrastructure and leverage automated processes to ensure continuous updates and seamless interoperability. The Ingestion Engine scans new research publications and integrates emerging datasets, keeping repositories up to date. The Harmonization Engine maps diverse datasets to standardized ontologies, allowing cross-study comparisons and ensuring consistent metadata annotation and data quality. Polly Insights further enhances utility by providing analytical tools for extracting meaningful patterns from large datasets.

Atlas can be either unimodal (focusing on a single data type) or multi-modal (for instance, integrating diverse data types such as genomics, proteomics, imaging, and EHRs). They are further categorized as domain-specific Atlas, which store data relevant to a particular field (e.g., oncology, neurology), and project-specific Atlas, tailored to specific studies or collaborations.

For instance, a leading precision oncology company leveraged Elucidata’s Drug Atlas to streamline its high-throughput drug screening process. Previously, the company faced challenges with fragmented data storage, inconsistent nomenclature, and inefficient manual workflows. By implementing Elucidata’s Drug Atlas, they automated data ingestion, harmonized metadata, and significantly improved data findability across experiments. This reduced time spent in data wrangling by approximately 1000 hours, accelerated comparative analyses by seven times, and enabled researchers to extract insights across multiple drug-cell line combinations with just a few queries.

Similarly, a California-based genomics-driven pharma company utilized a Public Atlas to enhance target identification for immunological diseases and cancer. The company faced hurdles in leveraging publicly available transcriptomics data due to incomplete metadata and a lack of standardized ontologies. By integrating and harmonizing large-scale transcriptomic datasets from sources like Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA), Elucidata built a Pan-Cancer Immune Atlas, which facilitated the identification of a novel immunology target in just six months, a process that traditionally takes 2-3 years. Additionally, this structured repository enabled a $3 million cost reduction and freed up 2,000 hours annually for R&D and bioinformatics personnel.

These case studies exemplify how our Atlas has tangible benefits in accelerating biomedical research and drug discovery. These examples demonstrate that structured repositories improve data accessibility and standardization resulting in faster, more cost-effective research outcomes.

The Atlas Creation Process

Data Ingestion – Collects and standardizes genomic, transcriptomic, proteomic, and clinical data from diverse sources.
Data Harmonization – Curates and annotates data, aligning with FAIR principles to ensure accessibility and interoperability.
Data Organization – Structures the harmonized data into a flattened model, integrating metadata, treatments, and outcomes for efficient analysis.
Quality Assurance – Implements rigorous validation steps to ensure accuracy, consistency, and reliability.
Deployment & Access – Publishes the validated Atlas on Polly’s cloud-based platform, making them easily queryable and shareable.

What sets Elucidata’s Atlas apart?

Building Atlas presents unique challenges, particularly in maintaining consistent data standards across diverse datasets. Based on end-user applications, we have to ensure that data is of the highest quality and is not lacking in any information. For example, in an EHR-focused Atlas with a goal to evaluate triage-specific data, missing triage-level information could limit clinical insights, which requires improvements in data collection and curation; or standardization through including the missing data. Similarly, in multi-modal datasets, ensuring uniform quality metrics across different data types is critical.

One of the defining features of Elucidata’s Atlas is their dynamic architecture. New data can be continuously added without altering existing records. Features can also be incrementally integrated, similar to adding new layers of information to a map (roads, traffic, landmarks, etc.). If necessary, global changes can be applied across datasets while preserving integrity.

Our atlas is also highly user-friendly, in that users can easily query their specific questions in natural language and find answers in the form of detailed tables, graphs and summaries. We are hard at work to add on another exciting feature, which will enable users to create their own atlas.

Conclusion

The journey from fragmentation to integration in biomedical research is facilitated by the adoption of structured repositories. At Elucidata, we integrate siloed datasets into structured, AI-ready repositories. Atlas streamlines cohort building, facilitate multi-omics research, and enhance predictive modeling for biomedical applications. They serve as a foundation for data-driven discovery, empowering researchers to extract novel insights and drive precision medicine forward. By centralizing and harmonizing diverse datasets, these platforms overcome the limitations of traditional data silos, enabling comprehensive analyses that are essential for advancing our understanding of complex biological systems. Contact us today to learn more about how we can build an atlas for your next biomedical research project.