Over the last two decades, advances in cutting-edge engineering and sequencing technology have led to increased generation and availability of biological data. However, extracting useful insights from these complex data will require a major upgradation in our methods of data analysis due to the sheer complexity of biological systems which necessitates integrating biological and chemical data of different types and from different sources to draw actionable insights.
The biomolecular data within the scope of our discussion typically include the following:
These biomolecular data are routinely gathered and used alongside other types of data such as patient clinical records and imaging throughout life sciences, pre-clinical and clinical research.
Conventionally, these data have been analysed using various kinds of statistical algorithms, which are largely limited to fitting known functions to the data. Machine Learning (ML) takes this concept to a whole new level by allowing researchers to fit unknown functions to data in order to generalise the results to new data that the model has never seen before.
The use of Machine Learning in the pharmaceutical and biotech industries is growing as evidenced by the increasing number of collaborations between big pharmaceutical companies and artificial intelligence companies across their pre-clinical and clinical programs as shown in the figure below. The number of these partnerships has been increasing exponentially over the last 5 years indicating increased trust and adoption of AI tools in the industry as well as recognition of the need for collaboration in the space.
Just last year, a biotech company, Evotec, announced that they are taking an anti-cancer drug developed with Excientia, a company that uses Artificial Intelligence (AI) for discovery of small molecule-based drugs, to phase I clinical trials. Several diagnostic tests that have been tested in trials or are under investigation use ML techniques to identify novel biomarkers. At Elucidata, we have used Machine Learning on multi-omics data to find multiple drug targets for Acute Myeloid Leukemia that have been validated experimentally.
For analysing biomolecular data, ML models prove to be very handy in four ways:
Due to the above reasons, Machine Learning is fast becoming an indispensable tool for analysis of biomolecular data. However, it has also been a cautionary tale as evidenced by a recent review of Machine Learning using COVID imaging data that determined that most studies had low quality of input data, biases in the data, or problems with the ML methodology which led to development of many models that were not reliable enough to apply in clinical settings. Another excellent review paper by Google explains that in many real-world applications outside of internet companies, where the datasets available are often small, the quality of data is more important than the quantity of data. Not paying attention to the data can lead to not just lost time and sunk cost, but adverse real-world impact as well. Therefore, ML should not be thought of as a magic wand. Running a successful ML project will require you to rigorously assess your data as well as your methodology. The new paradigm of data-centric ML1 is especially relevant here to enable accurate predictions with relatively small amounts of training data, as is often the case with applications using biomolecular data.
If you decide to start using Machine Learning in your research, we recommend using a data-centric approach that generally becomes useful when the size of available data is relatively small and it is possible to make considerable improvements in model performance through systematic improvements in data quality by incorporating domain knowledge.
In order to execute a data-centric ML project, we recommend the following steps:
Mathematical modelling of biology and making predictions about the behaviour of biological systems has been hard due to the extremely high level of complexity and large variability in system behaviour. This is unlike other non-living physical systems where at least the equations governing the system behaviour are usually well known. This is where Machine Learning can make a huge difference! Machine Learning is not a magic wand, but it has a lot of predictive power if used judiciously with a data-centric approach. ML algorithms have the power to learn patterns in the data which may be too complex for the human mind to decipher. However, collecting and manually labelling a large amount of biomolecular data required for the training of usual ML algorithms is both expensive and time consuming. Therefore, for most practical problems, scientists have to find ways to work with a relatively small amount of data, which means roughly a few hundred to a thousand data points. Even in other ML domains like processing of language and images, we have seen the necessity of a data-centric approach despite the availability of large pre-trained models that enable making predictions with relatively small amounts of training data for specific tasks. Hence, it is certainly clear that research groups which follow a data-centric approach for ML applications in biology will far outshine groups that do not. This will also usher in a new data-driven approach to science!
1 These data-centric techniques focus on systematically improving the data to boost the performance of established prediction models. In addition to general statistical methods of improving data, domain knowledge integration is also an essential part of data improvement when working with biomolecular data.