Metabolites are building blocks of cellular function and play a crucial role in sustaining life. They are involved in enzyme-catalyzed chemical reactions, are essential for cellular functions like metabolism and energy storage, and have secondary functions in cell‐to‐cell signaling and virulence. These metabolites are the final downstream product and are the best representation of the molecular phenotype of health and disease. It holds a wealth of information, and there is so much to learn about the metabolome as it is thought to be most predictive of phenotype. Uncovering this knowledge is a work in progress.
With advancements in Omics technologies, the researchers focus on studying the basics of biological systems and providing holistic knowledge of biology comprehensively. Metabolomics is a critical comparative tool to study global metabolite levels in samples under different conditions. The field of metabolomics is still evolving. The community has developed a range of instrument methodology and has leveraged myriad data processing and analysis approaches that accelerate the measurement of metabolite levels directly from cells and tissues. Doing this benefited them to expand their knowledge of metabolites’ distinct roles and potentially lead to discovering novel metabolites and metabolic pathways.
The metabolomics experiments are performed either by targeted or untargeted methods. Targeted metabolomics focuses on measuring several known metabolites of interest, with opportunities for absolute quantitation. After quantifying the specific compounds they are compared to established reference ranges. It is used to test a hypothesis. In contrast to this, untargeted metabolomics focuses on collecting data and relative quantitation of small molecules in a sample, without pre-existing knowledge. This is most useful in generating hypotheses.
While untargeted metabolomics or hypothesis-generating workflows exhibit many valuable attributes and allow a deeper understanding of disease states, there are multiple roadblocks that limit its wider adoption.
Why is untargeted metabolomics analysis not routinely performed?
Profiling thousands of metabolites with different chemical properties in a biological sample bring in unique challenges or bottlenecks. In this blog, we address the various obstacles in five significant aspects of an untargeted metabolomics experiment.
Problem 1: Noise and redundancy in datasets
In the past two decades, advances in chromatography and mass-spectrometry methods have significantly improved our capacity to obtain data from each biological sample. However, this comes with the critical challenge of weeding out the inherent noise in these datasets.
One recent study shows that more than 25K features were detected. However, a large proportion of these are non-reproducible (not present in all relevant samples), adducts, contaminants, and artifacts and ultimately results in less than a 1000 metabolites.
Figure 1: Inherent noise in metabolomics dataset
Problem 2: Subjective expert curation
Datasets often contain peak-groups or features on which it might be difficult to get a consensus opinion on whether to curate the peak-group as real signal or noise. These form the ambiguous peak-groups. Some experts may value cohort information in making such decisions, while for others it may be determined using Gaussian fit of the detect peaks. Thus this leads to variability in peak-group curation and not desirable for an organization and also not reproducible (Figure 2).
Figure 2: Consensus amongst 6 Experts on 6 different datasets
Problem 3: Long and tedious curation process
Often to make untargeted metabolomics analysis tractable, experts use filtering strategies such as minimum peak intensity, signal to baseline ratio, signal to blank ratio, peak quality, etc. Strict thresholds results in a handful but high confidence peak-groups. A clear disadvantage here is that we miss out on low abundant signals. Fewer metabolites would lead to weaker hypotheses and fewer insights. This limits us from exploring metabolomics datasets to its full potential. At more relaxed thresholds, the user faces the challenge of manually weeding out the noise, which is time-consuming and required expertise.
Figure 3: From peak-detection to metabolite identification
Problem 4: Human error and inconsistencies
People make mistakes. Carelessness, time pressure, or long manual curation are common reasons for inconsistencies. It is often difficult to zoom in/out or hide samples with extreme intensities for each peak-group and to come to a decision. Sometimes since the process is spaced over days, it is also difficult to maintain a coherent curation process by an individual, even if done by the same person.
Problem 5: Computationally intensive and expensive
With ultra-high-throughput and high-resolution instruments for obtaining metabolomics data in different modes, full-scans, and multiple samples, often leads to gigabytes of data and increases our dependency on high-end computational resources for processing and analysis. These pose a new set of challenges of managing cloud resources and setting up the environment for data ingestion, processing, and analyzing.
Elucidata provides a machine learning solution, Polly-PeakML, for the classification of detected features as real signals or noise. As the key to any successful machine learning model is a comprehensive training dataset and well-defined feature descriptors; five metabolomics experts curated more than 6000 peak-groups that were derived from multiple LC-MS vendor-agnostic datasets. Unique characteristics for these curated peak-groups were encapsulated as mathematical descriptors and used for training one of the most advanced decision trees and boosting method – XGBoost classifier.
Polly-PeakML is a one-stop solution for automatically classifying peak-groups into real signal and noise with an accuracy of 94% on validation datasets. Polly-PeakML provides consistent behavior with reproducible results. The ML model circumvents the need for optimizing parameters for peak-detection, thereby allowing users to include low-abundant features and not fret about removing noise. Polly-PeakML is seamlessly integrated into El-MAVEN for easy adoption with a user-friendly interface. Along with the Polly platform, Polly-PeakML enables researchers to perform peak-classification of thousands of detected features in a matter of minutes (120 times faster!!!)
Assisted with machine learning, Polly-PeakML can help you solve the challenges in untargeted metabolomics. Polly manages the technology so that you can do high-level research. Book a session today to make the most of your work.