“Modern-day language models have already breached the trillion parameters mark, and they are only going to get bigger and more intelligent.”
The availability of genomics and other biomedical data is on a gradual rise. As a result, many machine learning(ML) models have been proposed for a wide range of biomedical tasks. Within ML, breakthroughs in natural language processing(NLP) techniques have been extended to the field of bioinformatics/genomics. Large language models can perform a variety of NLP tasks with accuracy approaching (and in some cases even exceeding) that of humans. Examples of these tasks include speech recognition, named entity recognition, relation extraction, QnA, sentiment analysis, etc. The biomedical domain is now reaping the rewards of these large language models. Today, bioinformaticians and computer scientists rely on these language models to make the best use of biomedical text, electronic health records, protein, and DNA sequences for various biomedical tasks. In this article, we shall look at the current state of Biomedical NLP and how state-of-the-art models are helping scientists navigate through vast and complex biomedical data.
“Before 2018, technologies that could answer questions tended to be brittle, hard to build, and even harder to scale. The advent of neural language models changed that.” – Khattab et al.
Traditional NLP techniques rely on hand-crafted rules and heuristics. However, these hand-crafted rules require extensive domain knowledge to create and maintain and are often not generalizable. Recent advances in named entity recognition(NER) and relation detection through deep learning automatically learn from a large corpus to obtain an optimal set of features without human intervention. Current language models can recognize drugs, genes, and disease terms, and detect drug-gene or drug-disease relation types given a set of documents.
The latest spurts of NLP success can be largely attributed to transformer-based pre-trained language models. These state-of-the-art models incorporate different components of a Transformer neural network. They can be classified as:
- autoregressive language models (e.g. GPT-2, GPT-3)
- masked language models (e.g. BERT) and
- encoder-decoder models (e.g. T5).
Autoregressive language models are unidirectional and are trained to predict the next word given all the previous words. Masked language models like Google’s BERT predict a “masked” word conditioned on all other words in the sequence. BERT model uses the encoder portion of the Transformer architecture. Whereas, the encoder-decoder model is used to generate a sequence of tokens(sequence of characters) given an input.
These models are good at learning highly effective contextual representations, which is critical to understanding biomedical literature. For instance, BERT can tell when the word ‘bank’ is being used to refer to a riverbank or the financial institution based on the context in which it appears. Unlike uni-directional models, BERT’s bi-directional nature allows it to interpret a word, not just based on the words that come before it, but also ones that come after it.
LMs At The Forefront of Biomedical NLP Revolution
NLP’s foray into the biomedical domain is due to the introduction of large language models such as BERT, which is being tweaked and deployed to grasp biomedical data in a better way:
- Clinical text biomarker mining: Predicting genomic biomarker status given a patient’s clinical notes.
- Biomedical literature mining: Recognizing entities in the literature and extracting relations among these entities.
Text or literature mining is something the conventional language models are already good at. The ML researchers have used this to their advantage by tweaking around BERT-like models for biomedical applications. There has been an upsurge in such models of late. Here are a few such state-of-the-art models which are helping scientists to navigate through vast and complex biomedical data:
Ever since it was open-sourced, Google’s BERT model has been one of the widely accepted models in NLP benchmarks that makes it spread to various tasks in NLP. At Elucidata, these language models help to scan through biomedical literature and extract information like which is later used to enhance search. Elucidata’s PollyBERT—built on top of BERT—enriches the way we access metadata from various data sources. A central pillar of PollyBERT (Polly’s curation infrastructure) is the use of ontologies and controlled vocabularies for annotation of metadata fields such as disease, organism, cell line, tissue, cell type, drugs, genotypic perturbation, chemical perturbation, etc. Access to these annotations gives users powerful mechanisms to query this data.
BioELMo is a biomedical version of embeddings from a language model (ELMo), pre-trained on PubMed abstracts.
BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora.
BioMedBERT is a neural-based deep contextual understanding model for Question-Answering (QA) and Information Retrieval (IR) tasks. It pre-trained on the BREATHE dataset, which contains abstracts and full-text articles from ten different biomedical literature sources.
ProtTrans provides state-of-the-art pre-trained models for proteins. ProtTrans is trained on thousands of GPUs from Summit and hundreds of Google TPUs using various Transformers Models.
DNABERT enables global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variants.
In spite of all the advantages these LMs have to offer, they are really expensive to train. They are resource-heavy. Most of the biomedical NLP applications are woven around expensive models such as Google’s BERT or OpenAI’s GPT, which was not originally made for biomedical applications. Since training an LM from scratch is an uphill task for smaller research groups, researchers believe that attention should be directed towards the reduction of complexity and eventually costs. Recent innovations in techniques such as quantization, pruning, and knowledge distillation try to address these challenges. This is one area that Biomedical NLP researchers should be interested in if they are to drive down costs of training.
Populating embedding spaces of pre-trained LMs is another area that has been of interest to researchers of late. Petroni et al., have already discussed the presence of relational knowledge present (without fine-tuning) in a wide range of state-of-the-art pre-trained language models. Language models as knowledge bases of biomedical data can be leveraged to recall factual knowledge without any fine-tuning can serve as highly performant unsupervised open-domain QA systems. Researchers believe that the field of biomedicine deserves language models with new architecture designs tailored to protein sequences and other such sequence motifs.
The recent pandemic has underlined the need for accelerated drug discovery and Biomedical NLP is a crucial element in addressing the shortcomings. The ecosystem is ripe with data resources but is yet to mature in terms of quality. Tools such as Elucidata’s Polly combine the enormous potential of NLP and data velocity to make it easy for the scientific community to immediately address the delays of drug discovery without the additional overhead of software engineering.
- Language Models as Knowledge Bases?
- Med-BERT: pre-trained contextualized embeddings on large-scale…
- On the effectiveness of small, discriminatively pre-trained language representation models for biomedical text mining
- Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art