Context
This post is an action item that emerged from a recent meeting organized by @maxrjones and attended by @jbednar, @clyne, @ktyle, and myself. The topic was how to integrate and harmonize the different “notebook examples” websites that exist in the Pangeo ecosystem. Specifically:
- https://earthml.holoviz.org/tutorial/index.html
- http://gallery.pangeo.io/
- https://cookbooks.projectpythia.org/
These projects all share similar goals:
- Provide a pretty public website with examples of real-world geoscience using python tools
- Support an open contribution process, whereby the content can be expanded and updated via GitHub PRs
- Use CI to automatically execute the notebooks to ensure correctness
- Support different environments for different groups of notebooks
We all basically agreed that Pythia Cookbooks is the best foundation for the future, and maintainers of the other two agreed to migrate our content to Pythia Cookbooks. (This is already underway for some of Pangeo Gallery.) Most importantly Pythia seems to have the most sustainable organization for long-term maintenance of the gallery infrastructure, given its affiliation with NCAR.
How does Pythia Cookbooks work?
(Someone from Pythia please feel free to offer corrections; this is just my rough summary.)
Each cookbook is a repo within the ProjectPythiaCookbooks
organization
Each repo follows the same template
The template is based on JupyterBook. Each repo contains notebooks, environment configuration, and CI scripts to automate building and deployment of the book. Each book is published via GitHub pages to its own path within the cookbooks.projectpythia.org
domain, for example: https://cookbooks.projectpythia.org/cesm-lens-aws-cookbook/
It’s a very neat system for managing and automating publication of a large, crowsourced collection of many Jupyter Books!
What’s Missing: A Build System for Compute- and Data-Intensive Books
We all agree that it’s important for such a system to execute the notebooks from scratch automatically (rather than allow users to check in executed notebooks) for the following reasons:
- It ensures the code actually works
- It ensures the outputs are consistent with the inputs
- It guarantees reproducibility of the results and facilities a good experience on Binder
Pythia Cookbooks uses Github workflows to build the notebooks. (They even created a custom Github Action: GitHub - ProjectPythiaCookbooks/cookbook-actions: Reusable workflows used by Project Pythia Cookbooks). The problem is that the GitHub workflow runner has limited compute resources and can’t support “expensive notebooks.”
To work around this limitation in Pangeo Gallery, we developed Binderbot, which allowed us to leverage Binder within a CI job to outsource the job of running the notebook to a Binder Hub. In the case of Pangeo Binder, this allowed us to have more resources, access data on the cloud directly (without paying egress fees), and even start up Dask Gateway clusters, all within our notebooks. This was a cool feature, but the way we wired it all together was a bit hacky. Also, Pangeo Gallery did not take advantage of Jupyter Book, which barely existed when we started.
Proposal: Develop the capability to build Pythia Cookbook books in any JupyterHub or BinderHub
This is a relatively general problem for the Jupyter / Jupyter Book community, and so an ideal solution would also be quite general. I will refrain from offering specific implementation ideas at this point. Suffice it to say, we could leverage the JupyterHub API to connect the Cookbook build system to any running Hub and execute the book build in that environment, rather than in the CI environment. That would allow the notebook to use any resources that are available in a particular Hub, including special data, GPUs, etc.
I propose we engage the team at 2i2c to discuss how we can leverage existing grants to help pursue this capability in a way that is informed by the latest best practices in Jupyter Book and Jupyter Hub, with the goal of delivering a solution that is useful not only to Pythia / Pangeo but to the broad Jupyter community.
I welcome feedback on the feasibility of this idea, as well as technical discussion of how it might be implemented.
Shamelessly tagging some folks who work on this stuff: @brian-rose, @sgibson91, @choldgraf, @yuvipanda.