Data-driven Target Identification Using Public Datasets

Elucidata’s mission is to accelerate drug discovery by using a data-driven approach. In line with this, over the last 4 years, we have worked with several partners to improve the understanding of biological systems and design novel therapies. These organizations include big and small pharmaceutical companies, early-stage biotech startups and academic labs with diverse research interests.

While working with our partners, a consistent trend we have noticed is the increasing use of public datasets. More than 70% of our projects involve significant analysis on one or more public datasets. This is far more true for biotech startups that have fewer resources to conduct their own experiments.

Using Relevant Public Datasets Helps Shortens Timelines

A lot has been spoken about the expanse of publicly available biological data. Data accumulation by EMBL- EBI has increased by more than 7 orders of magnitude in less than 10 years. TCGA, GTEx, GEO, Metabolomics Workbench, PRIDE, etc. are some of the most popular publicly available resources which generate and/or aggregate biological data.

In our experience of supporting drug programs over the past couple of years, we have realized the benefits of using public data to aid and supplement any research effort, along with the challenges it brings along. It allows scientists to ask questions and generate hypotheses. All this without having to invest time, money and other resources to generate their own data. Public datasets are often also used to support the hypothesis and findings from independent experiments.

For discovery programs working on a tight budget using published datasets, if done effectively and efficiently, can be the differentiator between going to the next step or dying a natural death.

Finding and Using Relevant and High-quality Datasets Is a Challenge

In spite of the availability of numerous resources of molecular data, they aren’t used to their full potential. The biggest roadblock in getting started is identifying the most relevant publication/datasets for your context from the massive data sea out there. The search capabilities provided with data repositories return thousands of related hits. Sifting through the hits requires significant manual intervention.

It could take weeks for a scientist to analyze all the datasets and find the most relevant one. Machine learning techniques coupled with a scalable technology platform can reduce this to minutes.

Even if the scientist has read tens of papers and identified the most relevant datasets, the challenges around data handling, storage, analysis, and integration follow. The size and complexity of biological data call for an extensive in-house data storage infrastructure, computing resources and analytical expertise to mine it for meaningful insights. This often is a big ask from academic labs and small biotech companies.

Public portals like cBioPortal, GEPIA, ARCHS4, TCGA Firehose, UCSC Xena browser, etc. are some of the efforts in recent times to overcome these challenges. They provide easy access to the datasets as well as results of pre-defined analysis pipelines on these datasets and eliminates the need for in-house resources to perform a preliminary analysis. Each of these tools and portals is however rigid in its own ways and lacks the ability to customize an analysis, based on the specific needs of a project, which is often the requirement.

Unified Recommendation Engines for Public Datasets Might Be the Answer

So how do we solve these challenges? There are already efforts such as CREEDS, GEM-TREND which can recommend signatures (set of genes with a characteristic expression pattern) and datasets, given a scientist has data of their own. However, these algorithms do not take metadata into account and that hampers the specificity of the dataset search. Efforts such as CREEDS have shown that manual curation + algorithms work much better than algorithms alone. A machine learning solution which takes into account the metadata and high-quality manual curation will then provide far better results than recommendations based on data alone. We need a platform which can integrate different repositories, enable manual curation, create personalized models and enable powerful analysis in 1 click.

A platform, like Polly, can return a limited list of publications based on the user’s personalized history. It can also rank them in order of relevance. The scientist, instead of looking at thousands of search results would look at results curated for him by the platform. This could also take care of the localized ecosystem – what are my peers in my lab reading. Once the user selects the relevant datasets, the platform would suggest the analysis tools and pipelines that can be run on the data. The user would be able to run the desired analysis tools and pipelines in 1 click without spending hours wrangling data. All this while, they retain complete control over the parameters being used by the pipelines.

A platform which solves these challenges will be able to inform the research efforts of a drug program most effectively and help identify potential targets with faster iterations. In our view, such a platform would help meet our mission of accelerating drug discovery by using data.

Blog Categories

Data Analysis and Management

Data Quality & Compliance

Industry Features

Product & Engineering

Data Science & Machine Learning

Company & Culture

FAIR Data

Others

Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.

Oops! Something went wrong while submitting the form.

Other Resources

Case Studies Dataset Roundup Documentation Glossary Solution Briefs Webinars Whitepapers

Upcoming Webinar - AlphaGenome Unpacked: Promise, Progress, and What Comes Next for AI in Genomics

Join us

[Upcoming Webinar] Scaling High-Quality Data Processing: Achieve 4x Cost Reduction for Foundation ModelsRegister Now->

Reserve Your Seat

Data-driven Target Identification Using Public Datasets

Using Relevant Public Datasets Helps Shortens Timelines

Finding and Using Relevant and High-quality Datasets Is a Challenge

Unified Recommendation Engines for Public Datasets Might Be the Answer

Blog Categories

Talk to our Data Expert

Other Resources

Related Blogs

How to Choose the Right Data Analytics Platform for Biopharma Research

Navigating the Future of Healthcare AI: Opportunities, Challenges, and Ethical Considerations

Clinical Trials Data: Best Practices for Effective Analysis and Integration

AI Agents in Healthcare: Real Use Cases, Benefits, and How to Deploy Them Effectively

Scalable Infrastructure for Biomedical Data: Best Practices and Common Pitfalls to Avoid

Understanding Knowledge Graphs: Definition, Benefits, and Best Practices

Blog Categories

Get the latest news, industry insights, and updates delivered directly to your inbox.

Latest Blogs

How to Choose the Right Data Analytics Platform for Biopharma Research

How to Choose the Right Data Analytics Platform for Biopharma Research

Navigating the Future of Healthcare AI: Opportunities, Challenges, and Ethical Considerations

Navigating the Future of Healthcare AI: Opportunities, Challenges, and Ethical Considerations

Clinical Trials Data: Best Practices for Effective Analysis and Integration

Clinical Trials Data: Best Practices for Effective Analysis and Integration

AI Agents in Healthcare: Real Use Cases, Benefits, and How to Deploy Them Effectively

AI Agents in Healthcare: Real Use Cases, Benefits, and How to Deploy Them Effectively

Scalable Infrastructure for Biomedical Data: Best Practices and Common Pitfalls to Avoid

Scalable Infrastructure for Biomedical Data: Best Practices and Common Pitfalls to Avoid

Understanding Knowledge Graphs: Definition, Benefits, and Best Practices

Understanding Knowledge Graphs: Definition, Benefits, and Best Practices

Trending Blogs

How to Choose the Right Data Analytics Platform for Biopharma Research

Clinical Trials Data: Best Practices for Effective Analysis and Integration

Scaling Data Pipelines for High-throughput Bioinformatics

Decoding Complexities: The Critical Role of Deconvolution in Spatial Transcriptomics

Challenges with Diagnostics Data Processing Pipelines

info@elucidata.io

info@elucidata.io

info@elucidata.io