Notebook Reproducibility Project

alex-treebeard · August 6, 2020, 2:51pm

Hey Pangeo folk,

Been lurking here a little while so thought it’s time to introduce myself:

I’m Alex and have been a backend engineer for the last 7 years in consumer tech.

In the last couple years I’ve become interested in helping [data] scientists with engineering issues, to the point that I have built a notebook continuous integration framework on github actions.

Treebeard containerises and runs notebooks in GitHub – and should help you achieve reproducibility.

Feel free to message me with questions on this topic!

TomAugspurger · August 6, 2020, 3:42pm

Thanks for sharing Alex. Treebeard looks like an interesting project.

What control do users get on how / where the notebooks are executed. Pangeo’s perhaps unusual (but probably not unique) requirements are around

Data access (e.g. requester pays buckets on S3 / GCP)
Scalable computation (with a Dask cluster).

https://github.com/pangeo-gallery/ is what we’ve put together to power https://gallery.pangeo.io/. Specifically https://github.com/pangeo-gallery/binderbot for executing repositories of notebooks on Pangeo’s binder, which has access to spawning Dask clusters.

alex-treebeard · August 6, 2020, 3:57pm

Oh wow that is interesting. Didn’t realise that pangeo binder allowed dask usage like this.

GitHub actions is the execution platform we support. They provide a secret store and support for self-hosted workers which lets you authenticate with external services or run on custom hardware.

Treebeard has a mechanism for passing secrets from github actions into the repo2docker container if necessary (e.g. pulling from s3).

It sounds like your binderbot is getting the job done but arguably it’s cleaner to just use CI for automating dask cluster mgmt and running those nbs.

I too am a pragmatist though

Topic		Replies	Views
Wednesday February 22nd 2023: D’explorer Explore cloud datasets from your notebooks Pangeo Showcase	13	546	March 7, 2023
Integrated Marine Observing System EOI help News & Announcements	2	524	June 17, 2020
Cloud Optimized Geotiffs + Pangeo best practices Data	4	2083	January 21, 2021
Pangeo Showcase: "marimo: an open-source reactive notebook for Python" Pangeo Showcase	0	253	October 6, 2024
Jupyter/Pangeo/Open Science-connected sessions at AGU 2021 News & Announcements	16	1025	October 20, 2021

Notebook Reproducibility Project

Related topics