If you’re working towards answering empirical questions in the disease area of your interest, then you are like most of us – up in knots tackling a wave of large-scale biomedical data! This data is difficult to handle with existing database management systems in a high-throughput sequencing environment that constantly contributes to structured and semi-structured data – data that you then encounter on an everyday basis.
Yes, data is powerful. But why is it so hard to get the ball rolling on the analysis that you had in mind for a while? Who do you call? A Ghostbuster’s equivalent of the omics world doesn’t exist, yet. Let’s admit it, your life was hard BEFORE you set eyes on that data.
Let’s take an example. Your lab has started a project on COVID-19. While the team comprises highly trained biologists and bioinformaticians, no one has worked on this particular problem on COVID-19 before. Your team decides to take a public data-centric approach to the problem. Here’s the first question you ask :
What are the factors that influence COVID-19 progression? What are some datasets that have studied COVID-19?
Going On a Wild Goose Chase: Finding Valuable Data that Drives Research
The first in a line of hurdles is already here: finding a dataset is not a matter of effort but rather a serendipity. Let us say you find an interesting dataset, the same way you (as many a scientist) have done. A colleague who knew of your interest in the disease forwarded you a paper! Or you go to a public data repository like Gene Expression Omnibus (GEO) and find that getting this information could take anywhere from 2 days to a week. Let’s not forget that the first dataset is only the first of many to arrive at a reasonable set of hypotheses that will guide the IP for your organization.
The Problem With Repositories like GEO:
- Often, datasets on GEO do not have a counts file. This problem extends across databases likeTCGA, where usable, clean normalized data matrices are predominantly unavailable. Making sure that you have 10 normalized, scaled datasets can take anywhere from 2-7 days.
- A second important requirement is to make that the rows (molecular features measured – genes, proteins, metabolites) are harmonized with a standardized ontology. Estimated time to plan this ontology ranges from 2-7 days depending on the skill level of the bioinformatician
- Now that you have the data matrix and row identifiers, you need accurate metadata information about each of the samples. The tissue of origin, the treatment condition, the name of the cell line, and many other pieces of information. Assembling the metadata depends on much required, but hard to train skills in SQL/ a priori knowledge of relevant packages. The estimated time to assemble relevant metadata extends the process by another 2-7 days.
- Processing the public data using a standardized pipeline can take anywhere from 10-15 days with a significant investment of computational resources.
Let’s say you and your team try to analyze 10 datasets from GEO. To access 10 datasets, you have now spent approximately a month. This is assuming that every dataset you used is truly valuable to your question. Further, considering the challenges with finding the right dataset, you and your team are betting your time on a problem you just shouldn’t be solving.
But What About My Proprietary Raw Data?
If you have received millions of reads back from your latest NGS experiment, chances are you’re overwhelmed by the substantial amount of time and compute power that goes into preprocessing and normalizing raw omics data. If you plan to conduct multi-omics analysis with your data, another mammoth task presents itself – standardizing your dataset identifiers to match with other datasets without which clustering, visualization, and functional characterization become a challenge. Uncurated data, whether proprietary or public, equally hamper efficient data analysis.
Now in most cases, to identify closely related datasets, you need to understand the key concepts covered by the paper. Let’s say, after poring through the paper, you identify that it studies COVID-19 in lung tissue from humans. A quick Google search is likely to retrieve thousands of results that comprise of studies dealing with COVID-19 or lung tissue with no clear relevance to you. This is where ontologies step in as your best friends.
Why are Ontologies Important, Again?
Here’s where ontologies become important. Let’s step back and ask – What are ontologies? Let’s say you are looking at the image of a dog. An academic would call it Canis lupus familiaris. A dog trainer would simply call it a dog. If you like looking at dog pictures on the internet, you probably call them puppers. How does an algorithm return the right picture to you? A simple answer is the metadata. Pictures that contain a dog must have a standard metadata identifier. Irrespective of who took the picture (the academic, the dog trainer, the puppy enthusiast), every picture containing a dog must have a consistent identifier. Such a standardized terminology to describe dog pictures is then called an ontology.
A similar problem exists in biological data except at a grander scale. Let’s say we have two datasets that are studying COVID-19 in lung tissue from humans. Author A might describe this dataset as SARS-CoV-2 infection in bronchoalveolar cell lines versus Author B might describe it as COVID-19 infection in normal lung tissue. Unless you know the exact search terms, you might never find these two papers and group them, without significant effort.
The Plot Thickens
Now let’s say you have found an interesting set of results that shows certain similarities between COVID infection and a rare autoimmune disease. For most researchers, especially during current times, trying to gain access to samples is an uphill battle with multiple hoops around sample procurement, data processing, and analysis. However, there is a good chance that there are many legacy public datasets that have already studied this condition and have patient samples. How cool would it be if you could find these exact samples of your interest? Once again, different researchers denote samples in arbitrary ways. For example, Author A might identify normal samples as Ctrl/Lung while Author B might identify normal samples as Normal lung tissue. Once again, we run into the same old ontology problem.
The Big Solve
Access to curated data. On Polly, over 50,000 datasets from GEO are available in ready-to-use formats that can be imported to an analysis environment on Polly or on the client’s infrastructure within minutes to start an analysis. What’s more, we are regularly adding new datasets released on GEO continuously increasing the number of available and usable datasets. More importantly, we empower you and your team to filter relevant datasets from this massive store within minutes. (Read about how users filter datasets of relevance to them here).
Polly identifies biological terminologies in natural language with the accurate context of the datasets carrying the same information labeled with different identifiers. Further, Polly attaches a standard ontology to each biological entity that our model recognizes. This means that both datasets published by Author A and B will have multiple similar ontologies that will allow group them together on Polly. Consequently, you will be able to filter these datasets based on disease and tissue ontologies of interest. What’s more, irrespective of which database you use on Polly, whether it’s LINCS, GTEx, or GEO, disease, tissue, and other ontologies will truly remain consistent making relevant datasets immediately searchable.
Curation for Proprietary Data
Polly normalizes proprietary data schemas for you, so you can focus on analysis and not spend weeks cleaning up raw omics data yourself. Get access to harmonized gene and molecule names in a standard format compatible with the other datasets you use. You may also request bespoke study and sample level tags with consistent ontologies that allow you to confidently analyze your data.
Polly’s curation model goes one step further. Every sample is a dataset is identified with multiple ontologies that can be used to search for those samples. This means that every Normal, Lung, NHBE cell line ever studied in any dataset on Polly would be searchable with all three ontologies. You could potentially build a global normal lung dataset from multiple studies processed uniformly to use as a control for multiple studies. You could compare it to lung cell lines treated with Rapamycin across multiple datasets on Polly, once again searchable using sample level ontologies. You can find samples with the relevant disease ontology using a simple sample level search on the GEO data lake (read more about querying here).
The last two decades have seen unprecedented growth in the number of public and proprietary multi-omics datasets. Additionally, large-scale repositories of public data, e.g LINCS, GTEx, and DepMap are now available to a wide variety of users. Despite, significant advances in data availability, integrative analysis of public and proprietary multi-omics data has remained elusive. In short, despite the data being available, it remains highly inaccessible for ML-based workflows in molecular biology. Polly’s curation efforts are a first step towards helping the scientific community take a data-first approach to AI/ML-driven drug discovery.