About Elucidata: The CEO Speaks

Earlier this year, our co-founder and CEO, Abhishek Jha, was the guest of honor during an Ask Me Anything-style Q&A session on Slack hosted by Bits in Bio. Bits in Bio is a community devoted to people who build tools that help scientists unlock new insights.During these conversations, participants interview fellow community members about their companies, work, and plans for the future.

In part one of this two-part series, Abhishek gave an overview of our mission and shared his inspiration for founding Elucidata. In part two, the group dove into the role of machine learning in drug discovery and the challenges associated with applying these technologies.

Please note that some of the questions have been paraphrased for clarity and reordered to improve the flow of this article.

Part one: What is Elucidata?

What does Elucidata do?

We clean and link biomedical data at scale. We are advocating for a data-centric approach to AI. In very simple terms, it argues that clean data is more valuable than more data. Of course in an ideal world, you would have both. But we live in the real world, and often the data we use is not structured or harmonized. Our technology solves that problem for biopharma companies.

How did it get started? How did you meet your co-founders?

We founded Elucidata in 2015, right after my 5.5 years at Agios Pharmaceuticals. I met my co-founders Swetabh Pathak and Dick Kibbey during the course of my work at Agios. Swetabh has prior experience in building scalable tech and organizations. Dick is an MD/PhD professor at Yale.

Can you tell us more about your computational background? How did you get into the software side of things?

I did my PhD at UChicago and a postdoc at MIT. All of it was focused on building computational models to study proteins and later systems-level models of the immune system. At Agios, I continued computational work. There it was more focussed on integrating different types of data to understand the clinical phenotype.

In some ways, I was always involved with the software side of things in some shape or form; writing it, improving it, or using it. More fundamentally I see myself as a consumer of the technology and services that Elucidata provides. I would have benefited a lot from our offerings when I was at Agios.

How did your experiences at Agios motivate you to start Elucidata? What were the problems you saw there?

I can talk about it for hours. I would analyze omics data and most of my time was cleaning the files… things like putting the data in the right structure. Lots of scrubbing the column and row names, R objects, data frames, excel pivots, you name it. But what I would present to my colleagues was analysis. Something simple like PCA or more sophisticated like a classification model (which cell lines respond vs what do not). That is what was valuable for my team, and that is what brought "glory" to me. What was underrecognized was that most of my time was going to clean data, and I was doing it in a highly non-scalable fashion. This was a big problem that I experienced first hand and we are taking a shot at solving it at Elucidata.

I would be asked “oh just upload and clean some data useful for our experiments” but there was so much work to wrangle and clean just one set - and if there were many useful datasets, it would be unimaginable to add platoons of data scientists to grind through cleaning all of them, so I became determined to make a tool to fix this problem – applying all this new machine learning and natural language processing emerging in the 20-teens to somehow simplify the task of data wrangling. All that ML/NLP work turned into Elucidata.

You’ve mentioned that data isn’t “clean”. What makes data clean and how is Elucidata cleaning it?

We have automated the cleaning process. Human experts are involved to build the training data sets and continuously monitor the performance of our NLP models. Here are some details:

First is the data engineering step. We ingest data from the public domain that are present in variable tabular file formats (TSV, Matrix file, VCF, RDS) - depending on the data type. Those data are then transformed into a consistent standard tabular schema - which usually is GCT or H5AD for a single cell. This conversion of various file formats to a consistent file format is the main data standardization piece.
Next is the metadata enrichment step. Ingested datasets are mapped with relevant metadata about the experiment (a drug used, tissue, cell lines, disease condition, etc.) at 3 levels - the overall dataset, samples used in the experiments, and at the feature level. We also ensure consistent vocabulary/ontologies are used in the metadata annotation process, & each dataset is processed with uniform molecular identifiers.
Finally, the data is ready for consumption on Polly. Polly is our cloud-based platform where datasets along with their metadata files are packaged, analysis-ready, and stored in an OmixAtlas. From here, users can either search/filter for relevant datasets on the OmixAtlas UI or query programmatically, using curated metadata fields. They can also start analyzing these datasets through 3rd party integrations, or use them to train models.

Do you have to specialize Elucidata’s curation models in a specific field or does it work in different settings?

We make a specific NLP model for 1 entity. For example, Disease would have 1 model, Cell lines are another, and so on and so forth. This process is similar for clinical annotation models: Drugs, Dosage, Strength, and Frequency (for example) would be 4 different models.

Can you talk more about Polly? What is it and how would a user interact with it?

Polly is our cloud-based platform that provides clean and linked biomedical data for consumption. We have a GUI interface but we have been deliberate about being a code-first platform. The code-first approach allows us to integrate with tools like Spotfire, Sagemaker, etc as well as run powerful queries on the data (based on samples, features, and datasets).

What is the long-term vision for Elucidata, and what are some of the areas you see the company expanding in?

Shameless plug: We just announced our Series A led by F-Prime and Eight Roads. That puts us in a very good position to double down on our technology and expand. We continuously hear about similar problems in cleaning and linking pharmacology data, clinical trial data, manufacturing data etc. We have just started to scratch the surface. R&D data (more specifically tabular and text data) at large has been underserved. We are hoping to serve that community in months and years to come.

Given that you have been in the industry for 7 years and made a pivot that paid off really well, could you share what the major learnings have been on the market needs? How has the industry changed over time and what are some of the main trends you are equipping yourself for?

The industry has changed and continues to change quite dramatically. You see more companies invest both resources and expertise to take care of their data needs (post-data generation). Some needs and solutions are more established than others, like LIMS for example. But the overwhelming takeaway for us has been that we are still in the very early days of adoption of contemporary data practices and technology. That is where we see a place for Elucidata and a number of other exciting companies!

Special thanks to the Bits in Bio Community for participating in this conversation. We look forward to many more!