Jupyter Notebooks and reproducible data science

Introduction

One of the ideas pitched by Daniel Mietchen at the London Open Research Data do-a-thon for Open Data Day 2017 was to analyse Jupyter Notebooks mentioned in PubMed Central. This is potentially valuable exercise because these notebooks are an increasingly popular tool for documenting data science workflows used in research, and therefore play an important role in making the relevant analyses replicable. Daniel Sanz (DS) and I spent some time exploring this with help from Daniel (DM). We’re reasonably experienced Python developers, but definitely not Jupyter experts.

Methodology

Ross Mounce had kindly populated a spreadsheet with metadata for the 107 papers returned by a query of Europe PMC for “jupyter OR ipynb”. My initial thought was that analysing the validity of the notebooks would simply involve searching the text of each article for a notebook reference, then downloading and executing it using nbconvert in a headless sandbox. This could be turned into a service providing a badge for inclusion in the relevant repository’s README indicating whether the notebook was verified. It turned out that this was hopelessly naive, but we learnt a lot in discovering why. Here’s what we discovered about how notebooks are being published:

We both tried downloading and executing a small sample of notebooks from our list. We had very limited success, so DS continued to systematically try other notebooks on the list and document the range of errors encountered (see the “Problem” column), whilst I took a particularly complex notebook (from this article) and attempted to troubleshoot it to completion. By doing this, we learnt the following about Jupyter:

Some authors have taken the laudable approach (and extra effort) of bundling their notebook(s) inside containers, typically using Docker. This is very effective in implicitly documenting, via a Dockerfile, the required runtime platform, system packages, Python environment and even third-party software. This is also the approach taken by the complex notebook we analysed. We tried to build and run the relevant container from scratch, which demonstrated the following:

This whole exercise was hugely valuable in improving our understanding of Jupyter and of the challenges facing notebook authors. Based on our experiences, and acknowledging that non-trivial analyses will depend on potentially large numbers of dependencies and amounts of source data, we recommend taking the following approach to publishing notebooks:

Next steps

This was a one-day project, which leaves lots of potential work to be done. Here are some ideas:

Conclusions

Acknowledgements

Many thanks to SPARC and the NIH for funding the event at which we performed this investigation, and Daniel Mietchen and Ross Mounce for their advice and guidance.

Notes

This investigation was very much a collaborative effort. Any errors or omissions in this write-up are my own.

Further reading

Updates

07 March 2017: A repository demonstrating automatic verification via CI is now available.