Data Science & Machine Learning

Data as a Product: Unveiling Evolution and Impact in Life Sciences R&D

Swetabh Pathak
January 18, 2024

Data is the product now. It has been for a while. After all, data is the new oil. At the recent COP conference, all the oil-producing countries agreed that by extension data is causing climate change. Oil isn’t the culprit. After all, oil reserves are constant. Data is exploding. 

Jokes aside, access to data is imperative for Machine Learning. High-quality data powers useful ML Applications. One has to be creative about getting data. If you try too hard though, you get sued by the NYT. Well, who reads that liberal rag anyway!

The world of biology isn’t an exception. Some of the biggest initiatives in life sciences R&D today are around creating an ML-ready data corpus. The hope is that scientists will use these to do data-driven science rather than hypothesis-driven science. The increasing adoption of ML makes this more plausible than before. 

The UK Biobank is the largest such data corpus. Over 10+ years this initiative has generated the largest longitudinal repository of genomic knowledge available to mankind. Many other countries are now following the example.

A different kind of initiative is the Human Cell Atlas by the Chan-Zuckerberg (CZI) initiative. With the stated aim of managing all human diseases by the end of the 21st century, CZI is creating the most detailed data corpus of life at the cellular level. By its own account, CZI wants to ‘map’ the 37.2 trillion cells in the human body. 

CZI and the UK Biobank are just the start. They are to biology what the open-source software revolution was to software. Open source software gave a (more) level field to individuals and companies to innovate on software. A publicly available large data corpus will decrease the entry barrier to finding novel biomedical insights. 

Sure enough, scientists are using these datasets to create ChatGPT-like models. The hope is that ML models can synthesize the information better than human scans. The latest one - called ScGPT - uses more than 33 million cells - so-called single-cell data. While that seems like a large number - and it surely is - unfortunately it is in many ways too small. Remember the 37.2 trillion cells in the human body? This is smaller by 6 orders of magnitude. That is even without accounting for the fact that cells behave differently in different conditions and bodies and over time. So in reality we are probably off by many-many more orders of magnitude. 

Not surprisingly, it isn’t clear yet if these large ‘foundational models’ in biology will be able to generate novel insights. Positive early signs but nothing path-breaking just yet.

There are 2 ways to move forward - it’s not an either/or. One is of improved model architectures and the second is of a larger data corpus. The first one, fundamentally different models, is harder to anticipate. The second one, a larger data corpus, is a bit easier to speculate about. My reasonable guess is that we will have 100 (maybe 1000?) times the amount of single-cell data available in the next 5 years. This doesn’t account for the fact one could also combine other kinds of data - papers, bulk rna-seq, images, etc. If one were to add all that up we are looking at maybe 1000 times the amount of information that is currently available to a model like ScGPT. We know that with increasing amounts of information these models get better. Not just incrementally but often there is step change.

So in my view, it is safe to say that in a few years from now, we will start looking at some ‘fundamental models of biology’. Would that prediction then be much better than the opinion of an informed scientist who has studied those cells for years? Most likely not in all situations. But these models would have some advantages - information retrieval at scale, ingesting more knowledge faster, etc.

Coming back to where we started, so is data a product in life sciences R&D too? Biopharma for one has always thought about its data as the product. Data that is densely protected - data about trials or a particular molecule. It’s the ‘data’ from trials that bring much cheer to scientists and Wall Street alike. But this is evolving. There is a recognition that molecular and clinical data in and of itself could be a competitive advantage. Just as for companies like OpenAI a big differentiator is the years spent creating clean datasets from the internet. Hence initiatives like this from new-age biopharma companies like Owkin.

A few years ago, access to samples and patients was often the biggest competitive differentiator in life sciences R&D. Combined with technology (like single-cell) or a biological hypothesis, this could be a multi-billion dollar moat. 

Now it’s just not the samples but the data generated from them. Data has to be nurtured and loved if questions have to be answered quickly. It’s a resource to be maintained and tended to. Only then will the ChatGPTs of the biological world will speak the language of life.

P.S.- This blog was originally published on Polly Bits- our newsletter on LinkedIn for all things data. Subscribe here!

Blog Categories

Blog Categories

Request Demo