Data scientists are aware of the iterative nature of their fields. Writing code, extensive experimentation with data, and applying different techniques to the same data make the core of a data science team’s activities. A key part of these activities is aimed at data-driven hypothesizing and validation whilst iterating with multiple data sets and algorithms over months.
Unsurprisingly, ‘reproducibility’ has become an important consideration that merits rapid process streamlining and setting up of systemic checks.
Consider the case of a 6-month project where everything goes smoothly for the first 3 months. Consistent data analysis efforts independently done by 3 data scientists are able to build on top of each other propose a hypothesis. The last analysis and its results produced by data scientist A are relied upon and the team decides to follow the hypothesis and validate it. Another month goes by in which effort is put only into validating the hypothesis. Just as we think everything is hunky-dory,
- A team member(Data Scientist B ) points to a situation where the initial hypothesis needs to be validated again because of running into a situation where a new data set shows effects contrary to the initial hypothesis – effects way beyond comprehension.
- The vigilant team member finds the code for running the initial analysis that led to the hypothesis, going over the git repository and tries to produce the same file but the file now produced has different results from the one which was used to propose the initial hypothesis.
- Alarm bells start ringing and after much investigation, it is concluded(although not with certainty) that the file produced by data scientist A was either modified by a piece of updated code not committed to the cloud and residing somewhere on data scientist’s A laptop.
Meanwhile, efforts of three months are wasted on chasing smoke and mirrors. It is apparent that this unfortunate turn of events predicated on human flaw could have been prevented if data scientist A’s analysis was reproducible. Therefore, the need for every stage of a data science process to be reproducible becomes clear.
What does it mean for an analysis to be reproducible?
Much has been said about why reproducibility is important. However, what does it mean to analyze data reproducibly?
In short, the code for a reproducible analysis should make it an individual entity which does not need any outside intervention in terms of data from the local machine of an individual. As a starting point, we defined an analysis to be usable if:
- It uses data from a remote file storage servers (like drive/S3) and is able to dynamically download data from these servers
- Results are stored in simple file formats like (CSV, JSON) and we are able to upload these files into the remote file storage
- It saves the variables/objects of analysis (to create plots, for example) in Rdata or pickle files and these files too, are saved onto the remote file storage
Perhaps that goal of 100 percent reproducibility will never be reached. There always will be some cases where humans(and machines) will fail to notice errors.
At Elucidata, we strive to create order in whatever little dimension of space we are afforded. We also aim to push automation to the limit. So who better than a machine to make sure that all of our analyses are reproducible.
With this goal in mind, we set about to set up a series of automated checks that would make data sciences processes at Elucidata highly reproducible. Creating this system would also allow our data scientists to hypothesize and mull over significant biological questions.
Knowledge repo for reproducible data science
Data suggests (we kid!), the most common conversation around the water coolers of a data science company are about where you can find the data set. When collaborations around the office begin at the water cooler, you know things need to change.
With the exponential rate at which data sets are analyzed, a significant “knowledge leak” starts to creep in. These leaks are often a result of poor practices including distributed file storage, buggy code, and infrequent reviews.
With the goal of plugging this leakage, we decided to build a system on top of Airbnb’s Knowledge Repo to solve a looming threat in many research-focused data science teams.
Record and share every analysis with knowledge repo
Knowledge repos (KRs) are a great solution for sharing and collaborating on an analysis (specifically iPython notebooks and markdowns). Knowledge repository makes use of GitHub infrastructure to make this possible. There are three core concepts that make KRs a powerful resource for reproducible analyses at Elucidata.
- Every analysis is made from a source Ipython or an R markdown which is converted into a markdown.
- Every analysis is a git branch in the knowledge repos.
- A repository is a collection of posts (Ipythons converted to markdowns) that can be hosted somewhere and accessed by everyone.
A knowledge repo is just a git repo where every branch represents an analysis containing an Ipython and a markdown form of that Ipython. Polly’s KR platform allows us to host these knowledge repositories so that all posts can be easily viewed by anyone.
What knowledge repo gave us was a seamless way of committing code without hassles. Although an Ipython can be committed in a normal git repository, knowledge repo gives a way of organizing multiple Ipythons along with the added functionality of adding titles, tags, and authors for every post and chronologically sorting these posts.
Now, with the end goal of reproducibility on the horizon, our ideal solution looked like a tool which could assess if an analysis notebook fulfilled the criteria defined earlier. The solution that came to mind was continuous integration.
Continuous integration for knowledge repo
Although continuous integration is extensively used in software engineering, it is seldom employed for data science teams doing day to day analysis. When data science teams borrow processes from engineering the former are always left better off. We, therefore, wanted to create a continuous integration pipeline for knowledge repo which will indicate if a committed notebook passes certain tests for calling it reproducible.
Knowledge repo’s elegant ‘one branch one Ipython’ ideology helped in running tests only when a new branch was created or in other words a new analysis was added by any team member. Since this tool’s goal was to watch out for potential breaches in reproducibility we named it the ‘Data Science Watchdog’.
The results of the test are indicated on Bitbucket as soon as a commit is made to the repository. The reasons why a certain test failed can be checked on Circle CI’s interface.
With this process being used in our projects, we hope to not just check reproducible code but to encourage our data scientists to follow best practices while doing their analysis. Working as a team is not just about doing one’s own job but also about making others’ jobs easier. Although Data Science Watchdog is currently at a nascent stage, we are developing this further by expanding our definition of reproducible science through continuous automation. Seemingly small problems for data science processes can have huge impacts in the long term and the solution may lie in using engineering concepts tailored for data science use. By sharing this solution, we hope to motivate data science teams in other organizations to come up with ingenious solutions for their problems.