(Genome → Transcriptome → Proteome → Metabolome)
Cell biology has many regulations, feedback loops that work together to give rise to the phenotype. When a researcher captures the genome/transcriptome/proteome/metabolome of a cell, what he/she is really doing is capturing a snapshot of the cell (in practice this is an average of many cells, an exception being single-cell genomics/transcriptomics), very similar to a photograph.
A photograph is very easy to read, or is it? Think about looking at a family photograph from your childhood, it brings out a lot of emotion and has a lot of context within it. The same photograph will invoke no emotion (or lesser emotions) within a stranger.
The data that a researcher collects about a cell is a lot worse than a photograph, most of the times it will have only part of the picture, either a genome or a transcriptome or others. Even if a researcher manages to get two parts of the picture, let’s say transcriptome and metabolome, it is very hard to join these two parts of the picture and deduce inferences. There is one more complexity, a cell’s biology is even more complicated than a video (if you manage to get one), everything is literally interacting with everything but within that chaos, there is a perfect order. It’s really magnificent, beautiful, treacherous, horrifying at the same time. It’s music, dance, hugely dynamic.
So with part of the picture in hand, a researcher begins to look for familiar components with maybe a handful of genes/metabolites/proteins, or there might be few pathways which may have been implicated in recent studies. This means that a researcher has to be aware of the developments in their field and related fields. By definition, this is the job of a researcher but if we consider the numbers that we discussed in the previous part, it is impossible for any human being to go through that much papers.
On top of it, there are reproducibility concerns (reproducibility project concerns). This forces a human being and a researcher to look at results which hold the most promise, natural selection is even at play here. This results in the rejection of the rest of the data, although it is used for strengthening the statistical power of his/her results. But if a cell is a dynamic entity, changes in one part of its machinery must reflect changes in another, where no one is looking.
This is both a good and bad news, while on one hand, this means that a researcher is utilizing only a part of already “part of the data” resulting in fewer hypotheses which are very close to what is already known in the field, on the other hand, it also means that there is so much to discover that is unknown, links that nobody has dared to look at. This also means that already existing data can be reused to deduce new hypotheses, design new experiments.
To counter this a researcher should be aware of the biological contexts changing significantly in their data, they should also know beyond their research field, developments, which might impact their own research. Efforts have already started to counter this gap in knowledge. The beginning of it started with algorithms which can deal with high throughput data.
A chain of algorithms such as GSEA, GO enrichment, ORA started to give meaning to groups of genes which change in response to a perturbation (reference). A critical limitation of these algorithms is the interpretation of their output and almost always they give a similar result. The results largely depend on what has been already studied and established. Also to use such algorithms a researcher must be familiar with using the software since a lot of the algorithms do not have a cool user interface. These algorithms pushed a lot of research forward and helped in better understanding of the data. Newer algorithms sometimes built on old ones such as GSVA, Viper helped in seeing the data in a whole new light.
Algorithms which can utilize and reuse older data also came into the picture. Few such examples include CIBERSORT which can deconvolute transcriptomic data and infer individual cell populations. Since one type of omics is only a part of the full picture, integrative omics started to become popular. Algorithms such as CombiT which can integrate transcriptomics and metabolomics could bring forth new insights which were not possible before. Every year TCGA publishes integrative analysis of its own cancer data.
It seems like with increase in data, we have also increased in algorithms which can provide insights into the data. Each algorithm looks at the data with a different perspective and will provide a different insight. It depends really on a particular researcher than what they would like to further test, but in order to get to this stage, they must know these algorithms and which one would provide them with a further experiment.
A perfect catch-22 situation, this requires an elegant solution. Let us think about this a little bit more, in order to know more about the cell biology, we are collecting more and more data, we are also devising more and more algorithms which can get more insights from the data but we need a human being to run these algorithms and interpret their results. If we believe this lab, then there are about 4900+ tools for various types of datasets. If we consider even one day for a person to learn one algorithm and about 1000 relevant algorithms then it will take 3 years for one person to understand one dataset. Subsequent times will be much lower since that person will already be aware of the algorithms. But once data size becomes twice or thrice, it will become almost impossible for a person to do this job. And we have not even started to look at the full picture of the cell yet.
The pace at which data size will increase is definitely going to outpace the number of researchers in the world. Therefore it makes sense to automate the process of interpretation.