As pharmaceutical organizations move towards a data-centric drug discovery paradigm, it is apparent that machine learning will take center stage. However, ML-driven drug design carries several challenges – the need for appropriate datasets, ML-ready data, and the ability to generate and test evolving hypotheses, to name a few. Keeping these data-centric sentiments in mind, we’ve introduced a range of updates, features, and optimizations on Polly that ensure analysis-ready data remains central to drug discovery initiatives.
1. Explore a diverse collection of biomedical molecular data on a single platform
The volume of datasets on Polly is increasing continually. Currently, the platform hosts more than 700,000 ML-ready datasets spanning 24 data types and 22 public sources. Over the last month, 43.8k datasets were added from notable sources like HPA, CPTAC, and GTEx.
Additionally, 8 IL-2 studies were ingested from ImmPort, an immunological database, and analysis portal that aims to warehouse FAIRified data collected from clinical and mechanistic studies on human subjects.
Why access ImmPort on Polly?
- Users can access 64 studies comprising of curated Proteomics, Titer, Cytometry, and PCR Data. These datasets are enriched with harmonized metadata annotations, processed through a standard pipeline, and made available in consistent tabular formats
- Perform downstream analysis of this pre-processed data using expert-vetted pipelines or proprietary pipelines on Polly, through techniques such as differential gene expression, pathway enrichment, volcano plots, and more.
What benefits does ImmPort data on Polly bring to users?
Evaluate variations in immune response between individuals of different age groups by comparing differences in gene expression, cytokine stimulation, and serum cytokines on the administration of vaccines or drugs. Use patient datasets available on ImmPort to detect new markers and mechanisms behind the regulation of immune responses to various disease conditions in humans.
2. Access pre-processed genomics Data on Polly
Users can now access ML-ready genomic datasets from gnomAD; a database that aggregates and harmonizes both exome and genome sequencing data from a variety of large-scale sequencing experiments; on Polly. The platform currently hosts 180,000 datasets, comprising whole-exome sequencing data (70K), as well as whole-genome sequencing datasets(26k).
Why access gnomAD on Polly?
Datasets on gnomAD comprise an aggregation of variants from genome samples and are publicly available as VCF files for individual chromosomes. These files run into several 100’s of gigabytes, making downstream processing and storage of this data computationally intensive. Polly’s analytical pipeline dissects these large VCF files into smaller parts based on the genomic coordinates of a gene. Variants that fall outside these genomic locations are also collated into a single vcf file and brought into Polly, ensuring no data is lost while processing these datasets.
Additionally, Polly contains a point and click filtering functionality, allowing users to filter or query datasets according to a gene of interest, pLOF variation, LOEUF scores, and at the pathway level. For instance, if a user wants to query or find genes associated with bile secretion, the pathway level filter helps streamline this search process.
What benefits does gnomAD data on Polly bring to users?
GnomAD exome and whole-genomic datasets are prime data sources on which disease-specific and population genetic studies can be conducted. Data from gnomAD offers a layer of functional information on top of what researchers know from the genetic analysis of patients. These molecular insights help them analyze structural variations, multi-nucleotide variants, and 5’ upstream open reading frames in human samples and identify loss of function variants in drug discovery.
3. Customize and scale compute resources on El Maven
EL-Maven is an open-source LC-MS data processing engine that is optimal for isotopomer labeling & untargeted metabolomic profiling experiments. The size of data generated through these experiments varies, requiring computation power ranging from a small machine to a large workstation for analysis and processing. Polly supports a wide selection of instances, comprising of varying combinations of CPU, memory, and storage capacity that optimally support high throughput metabolomics experiments.
Over the last month, a number of compute-optimized machine types were added to El Maven, allowing users to better scale their resources and fit the requirements of their target workload.
|Instance Size||vCPU||Memory (RAM)|
We have also introduced new GPU instances that support computationally intensive workflows and ML applications.
|Instance Size||GPU||vCPU||Memory (RAM)|
|gpusmall||1 GPU||8||60 GB|
4. Transfer large files securely through Polly File transfer
Moving large files and massive sets of data quickly and securely across servers is a challenge many scientists in R&D teams face. Enterprise-level tools often require users to manually write commands for the upload of each file/folder, making the process needlessly cumbersome and time-consuming. For instance, large files/objects have to be uploaded on AWS S3 through a multi-part upload process using AWS CLI or an AWS SDK.
Polly File Transfer eliminates the need for programmatic file uploads and facilitates a fast, secure and reliable transfer process that is accessible for all stakeholders in an organization.
How this works:
Transfer multiple files and folders (up to 1TB) simultaneously between Polly Workspaces and your local directories through a point and click GUI, without the presence of a command-line interface. The tool also connects Workspaces with desktop applications like El Maven and is compatible with multiple environments – Windows, Mac, and Ubuntu.
This desktop application can be installed on a local system and launched using Polly Credentials, ensuring access to only authorized users of Polly. Further, the application prevents file imports to and from workspaces with “View Only Access”, thereby preventing the unauthorized transfer of files and folders.