Notebook Reproducibility Project

Hey Pangeo folk,

Been lurking here a little while so thought it’s time to introduce myself:

I’m Alex and have been a backend engineer for the last 7 years in consumer tech.

In the last couple years I’ve become interested in helping [data] scientists with engineering issues, to the point that I have built a notebook continuous integration framework on github actions.

Treebeard containerises and runs notebooks in GitHub – and should help you achieve reproducibility.

Feel free to message me with questions on this topic!

Thanks for sharing Alex. Treebeard looks like an interesting project.

What control do users get on how / where the notebooks are executed. Pangeo’s perhaps unusual (but probably not unique) requirements are around

  1. Data access (e.g. requester pays buckets on S3 / GCP)

  2. Scalable computation (with a Dask cluster).

https://github.com/pangeo-gallery/ is what we’ve put together to power https://gallery.pangeo.io/. Specifically https://github.com/pangeo-gallery/binderbot for executing repositories of notebooks on Pangeo’s binder, which has access to spawning Dask clusters.

Oh wow that is interesting. Didn’t realise that pangeo binder allowed dask usage like this.

GitHub actions is the execution platform we support. They provide a secret store and support for self-hosted workers which lets you authenticate with external services or run on custom hardware.

Treebeard has a mechanism for passing secrets from github actions into the repo2docker container if necessary (e.g. pulling from s3).

It sounds like your binderbot is getting the job done but arguably it’s cleaner to just use CI for automating dask cluster mgmt and running those nbs.

I too am a pragmatist though :slight_smile: