Challenges and Opportunities of Heterogeneous Biomedical Data

“In today’s Big Data world, companies rely on data scientists to extract insights from their vast, ever-expanding and diversified data sets… Many people think of data science as a job, but it’s more accurate to think of it as a way of thinking, a means of extracting insights through the scientific method.”-Thilo Huellmann

Biological Big Data: What It Means for Drug Discovery

In recent years, multi-omics studies such as genomics, transcriptomics, proteomics, etc. have helped scientists to derive insights into the molecular mechanism of disease progression. But each of these studies produces different types of data that differ in format and structure.

When a scientist tries to answer a research question, they invariably need to deal with the heterogeneity of different datatypes. Even a relatively straightforward transcriptomics analysis asks for dealing with several molecules and entities like transcripts/mRNAs, pathways, diseases, etc. This problem gets compounded rapidly when trying to work with multi-omics data and their analyses where one would see entities like genes, mRNAs, proteins, metabolites, pathways, diseases, etc. It becomes vital to ensure that these heterogeneous – but interconnected – entities are connected properly to unlock the full potential of the data.

‍

Importance of Biological Networks in Drug Discovery

According to Nilanjan Banerjee, an ML researcher at Elucidata, the industry has collectively moved toward a systems biology-focused approach, and with it, the fundamental aim of drug discovery changed. “It now aims to first understand the complex biological systems within our cells and how their dysregulation leads to disease, and finally develop methods to selectively target these systems,” says Nilanjan.

Biological networks often contain multi-scale elements, ranging from molecular components to tissues; both physical and abstract entities, ranging from proteins to phenotypic outcomes. Networks also contain diverse types of interactions between entities, such as inhibitions, activations, associations, and causal interactions.

A conventional way of gathering insights in the above scenario would require the following steps:

A gene knockdown problem (using a compound to inhibit a gene or a protein) would first involve finding relevant datasets or generating them in-house.These would likely be linked to one (or a small subset of) cell line models and diseases.
Using the disease of interest, scientists would find up-regulated or down-regulated pathways. These pathways would guide interpretation of disease genesis and which potential compounds can palliate the disease.
With a similar exercise on drugs, scientists can find the pathways a drug can up-regulate or down-regulate. A fit of the drug pathways on the disease pathways can guide which potential drug can be best suited to work against the disease.

This sounds much simpler in theory. Making the leaps in deduction as listed above is a laborious and iterative process. Imagine tabulating millions of rows and trying to capture relationships amongst them. This problem only gets worse at scale. Rather than doing the exercise above one-by-one, many bioinformatics scientists are turning towards knowledge graphs (KGs). KGs help reduce the search space and provide a system-wide view of the underlying biology.

Channeling Kgs for Drug Discovery

KGs containing drug, disease, and gene (Source: Paper by Sosa et al.,)

‍

“Knowledge graphs help in integrating information from across the literature of pharmacology, genetics, and pathology.“

KGs have been around for more than 3 decades. But suitable areas of application have emerged only recently. KGs trickled down from the need of designing a framework to connect the metadata across the world wide web. Graph theorists believed that representing data as a graph makes it logically explicit and intuitive. Today, KGs have become ubiquitous. From e-commerce giants to life sciences startups, KGs have extended their reach to almost everywhere. Especially in the case of life sciences, where heterogeneity is a major challenge, leveraging KGs to extract insights out of information is a likely solution.

A few demonstrated benefits of knowledge graphs are:

Early-stage drug discovery and repurposing, identification of pathogenic genes and drug targets.
Addressing data heterogeneity in multimodal data spanning transcriptomics, proteomics, genomics, phenomics, drug pharmacology, chemistry, and ontological information.
Borrowing analytical approaches from graph theory and machine learning to analyze biomedical data.

It is important to remember that knowledge graphs can be used not only for querying but for knowledge augmentation as well. Research in life sciences can be siloed and over-specialized. Researchers might miss out on valuable connections between concepts that aren’t obviously related. Knowledge graphs have been promising when it comes to identifying connections between heterogeneous data points.