As we continue to equip our academic and pharmaceutical partners with machine actionable biomolecular data, we’re focused on meeting the needs of multi-disciplinary R&D teams across the drug discovery process. Polly’s recent updates include a range of features and optimizations that simplify data discoverability, annotation and harmonization, integrative data analysis and collaborative research.
1. Evolving to enterprise grade permissions: View Only access on Polly Workspaces
User authorization is crucial to labs in academia and the pharmaceutical industry, where organizations often worry about IP Protection and Data Security. This is especially true for labs that need to collaborate on data, code and reports with stakeholders outside of their organizations. In order to address these complex collaboration requirements, user permissions for Workspaces on Polly have evolved to include View-Only access.
How this works
Admins from customer organizations now have the ability to create new Polly accounts. Earlier, these accounts could either be assigned an “Admin” or “User” role. In this update, we’ve introduced a new role called “Viewer”, where users are restricted to downloading files, notebooks and reports from Workspaces. Further, they have limited functionality on Polly, i.e. they cannot access other sections like Data Lakes, run Polly Notebooks or view analyses present on the Workspace.
View-only access also extends to sharing workspaces with collaborators. Along with the existing ‘Admin’ and ‘Writer’ permissions, these accounts can now be assigned as viewers on the workspaces shared with them.
2. Leverage Custom Pipelines best suited to your Data
Various labs and organizations routinely generate vast amounts of multi-omics data. However, this deluge of data isn’t useful without capabilities to build suitable analysis pipelines. Our in-house bioinformatics experts help bridge this gap by offering production-ready pipelines and applications for unique data curation, analysis, & visualization needs.
Automating compound screening experiments for a Cancer Research Organization
Compound screening is a high throughput experiment that involves analysis of >800 compound modifications on target proteins. This process is largely manual and error prone, where raw LC/MS data is mapped to a compound ID to determine modification
In an effort to simplify this process, a Proteomics Compound Screening pipeline and application was developed for one of our partner organizations. The pipeline, developed on Polly CLI, uses GNU parallel processing to automate the process of mapping compounds to LC/MS data. Further, an interactive Shiny application, visualizing analysis results was also developed for the partner.
3. Access curated LINCS data on Polly
Volume of ML-Ready Data on Polly is increasing continually. Currently, the platform hosts more than 580,000 datasets, spanning 20+ data types from 50 public sources. Over the last month, 155k curated datasets (drug_pert + shRNA) were added from LINCS, a consortium that catalogs gene expression and phenotypic responses of diseased cell lines to various perturbing agents. The LINCS Data Lake on Polly enables:
- Access to preprocessed data that can be fed into expert-vetted pipelines on Polly (or proprietary ones), for downstream analysis through techniques such as differential gene expression analysis, pathway enrichment, volcano plots etc.
- Application of this perturbation data in various biomedical studies – including drug repurposing, elucidation of drug MOA and identification of cell-type specific drug sensitivities.
4. Share Interactive Dashboards generated using Voila
Bioinformatics experts often use Jupyter Notebooks to execute multi-omics workflows for data analysis and reporting. However, notebooks are not the best communication tool for all audiences. For instance, non-technical scientists may not favor the presence of code cells or the need to run a notebook to see the results of an analysis. In order to help bioinformaticians share an interactive dashboard of visualizations generated through a notebook environment, with their R&D team at large, we’ve integrated Voila Dashboards with Polly Notebooks.
How this works:
Voila works on a simple premise – create standalone web applications from Jupyter notebooks. It is a language agnostic dashboarding system that supports interactive widgets and visualizations created on Polly notebooks. Additionally, Voila enables users to create customizable layouts for charts and tables that can be organized in ways that help viewers understand complex relationships in data.
5. Discover Relevant Data through Polly Python
Polly Python facilitates powerful querying capabilities across dataset, sample and feature – level metadata through code. To understand how the Polly Library enables in-depth data exploration on your preferred computational environment, read this document.
Currently, if a user runs a query on Polly Python using keywords like ‘NASH’, the library searches for exact matches (to NASH) and returns relevant datasets. However related disease terms like ‘non-alcoholic fatty liver disease’, ‘nonalcoholic steatohepatitis’, ‘nash-non-alcoholic steatohepatitis’, ‘non-alcoholic steatohepatitis etc. which would also return valid results; aren’t accounted for.
In order to build an intuitive search experience for users that simplifies access to a higher volume and variety of scientifically relevant datasets, we’ve introduced a range of optimizations to Polly Python.
How this works
Queries written on Polly Python are expanded through ontology tree mapping, which uses controlled vocabularies from sources like MESH, Mondo Disease Ontology, NCI Thesaurus, Human phenotype ontology etc. Additionally, we’ve introduced 2 new arguments – “Expand” and “Related terms”, to increase the volume of scientifically relevant data-sets returned through a query.
When a user searches for ‘Hepatocellular Carcinoma’ using the ‘Expand’ function, the query uses ontology tree mapping to include results from related keywords such as “adult hepatocellular carcinoma, “pediatric hepatocellular carcinoma” etc. The ‘related term’ function lets users explore an even greater volume of scientifically relevant datasets. It reduces false negatives, searches for connections to keywords used in the query and presents more options to the user to select from.
To illustrate an increase in the volume of data-sets associated with hepatocellular carcinoma through enhanced search, an SQL query using the “Expand” command was run on Polly Libraries. The command includes an additional 181 related disease terms, dramatically increasing the number of valid datasets that can be queried from the Liver OmixAtlas.
Output through a general query: 0
Output through Enhanced Search (queries using the ‘Expand’ function): 1393 Datasets
6. Monitor compute resources on El-Maven
El-Maven enables fast, powerful and interactive analysis of large metabolomics datasets. These experiments are usually computationally-intensive and require variable resources and machine configurations. In order to help users optimize resource utilization, improve performance of their analysis workflows and control costs associated with running multiple El-Maven instances, we’ve introduced a Resource Monitor on the application.
How this works
When resource consumption reaches a pre-defined threshold of 80%, a notification gets triggered. This prompts users to monitor progress of their on-going jobs and pick a bigger instance if required. The resource monitor also gives visibility on overall machine health, provides real-time data on CPU and RAM usage, and triggers an alert when CPU limits have been overshot.
Coming up next month
Biomedical data continues to grow at an unprecedented rate. This ever- increasing volume creates a gap between data that is available but not usable for ML-based workflows. In an attempt to bridge this gap, we’re introducing Data Pipelines / Connectors that continually curate and pre-process data from public or proprietary sources at high throughput.
Stay tuned to understand how this helps Computational/Data Scientists manage biomedical data within their organizations.