Statement of Need: Integrating JupyterBook and JupyterHubs via CI

Context

This post is an action item that emerged from a recent meeting organized by @maxrjones and attended by @jbednar, @clyne, @ktyle, and myself. The topic was how to integrate and harmonize the different “notebook examples” websites that exist in the Pangeo ecosystem. Specifically:

These projects all share similar goals:

  • Provide a pretty public website with examples of real-world geoscience using python tools
  • Support an open contribution process, whereby the content can be expanded and updated via GitHub PRs
  • Use CI to automatically execute the notebooks to ensure correctness
  • Support different environments for different groups of notebooks

We all basically agreed that Pythia Cookbooks is the best foundation for the future, and maintainers of the other two agreed to migrate our content to Pythia Cookbooks. (This is already underway for some of Pangeo Gallery.) Most importantly Pythia seems to have the most sustainable organization for long-term maintenance of the gallery infrastructure, given its affiliation with NCAR.

How does Pythia Cookbooks work?

(Someone from Pythia please feel free to offer corrections; this is just my rough summary.)

Each cookbook is a repo within the ProjectPythiaCookbooks organization

Each repo follows the same template

The template is based on JupyterBook. Each repo contains notebooks, environment configuration, and CI scripts to automate building and deployment of the book. Each book is published via GitHub pages to its own path within the cookbooks.projectpythia.org domain, for example: https://cookbooks.projectpythia.org/cesm-lens-aws-cookbook/

It’s a very neat system for managing and automating publication of a large, crowsourced collection of many Jupyter Books! :trophy:

What’s Missing: A Build System for Compute- and Data-Intensive Books

We all agree that it’s important for such a system to execute the notebooks from scratch automatically (rather than allow users to check in executed notebooks) for the following reasons:

  • It ensures the code actually works
  • It ensures the outputs are consistent with the inputs
  • It guarantees reproducibility of the results and facilities a good experience on Binder

Pythia Cookbooks uses Github workflows to build the notebooks. (They even created a custom Github Action: GitHub - ProjectPythiaCookbooks/cookbook-actions: Reusable workflows used by Project Pythia Cookbooks). The problem is that the GitHub workflow runner has limited compute resources and can’t support “expensive notebooks.”

To work around this limitation in Pangeo Gallery, we developed Binderbot, which allowed us to leverage Binder within a CI job to outsource the job of running the notebook to a Binder Hub. In the case of Pangeo Binder, this allowed us to have more resources, access data on the cloud directly (without paying egress fees), and even start up Dask Gateway clusters, all within our notebooks. This was a cool feature, but the way we wired it all together was a bit hacky. Also, Pangeo Gallery did not take advantage of Jupyter Book, which barely existed when we started.

Proposal: Develop the capability to build Pythia Cookbook books in any JupyterHub or BinderHub

This is a relatively general problem for the Jupyter / Jupyter Book community, and so an ideal solution would also be quite general. I will refrain from offering specific implementation ideas at this point. Suffice it to say, we could leverage the JupyterHub API to connect the Cookbook build system to any running Hub and execute the book build in that environment, rather than in the CI environment. That would allow the notebook to use any resources that are available in a particular Hub, including special data, GPUs, etc.

I propose we engage the team at 2i2c to discuss how we can leverage existing grants to help pursue this capability in a way that is informed by the latest best practices in Jupyter Book and Jupyter Hub, with the goal of delivering a solution that is useful not only to Pythia / Pangeo but to the broad Jupyter community.

I welcome feedback on the feasibility of this idea, as well as technical discussion of how it might be implemented.


Shamelessly tagging some folks who work on this stuff: @brian-rose, @sgibson91, @choldgraf, @yuvipanda.

6 Likes

Thanks for that summary, Ryan! My group will be very happy to adapt our work to the Project Pythia infrastructure, and we look forward to a solution to the “expensive notebook” issue so that we can contribute more ambitious work.

1 Like

I think this is a great idea, in my opinion the “notebook → execution via JupyterHub/Binder → jupyter book” loop is an important part of:

To flesh it out, I think that there are a few potentially separable steps here:

  • Given a list of notebook locations
  • For each item in the list
  • Fetch the notebook
  • Determine the environment needed to execute it
  • Send the notebook to a service that knows how to execute notebooks with a flexible environment
  • Do the execution and send the result back
  • Store the new notebook someplace
  • Generate a nice HTML output that displays all of the notebooks in a gallery.

Two other places to think of for inspiration:

2 Likes

I’d also like to tag-in @jmunroe who has also been working with Pythia around training material and infrastructure.

2 Likes

FWIW, binderbot already does all of the above. So perhaps this is mostly just doing some work to make binderbot more robust and integrating it somehow with Jupyter Book?

@rabernat thanks for articulating this so clearly!

I think your description of the current state of Pythia Cookbooks is accurate. We chose to build the Cookbooks on GitHub Actions as a “for now” solution (and based on our experiences with building Pythia Foundations which leverages the same stack). But the “expensive notebook” issue is something that definitely need to solve if this platform is going to help our field move forward.

I’m ignorant about many of the implementation challenges, but I do think that a “BinderBot + JupyterBook” solution makes a lot of sense. We want a portable execution + publishing platform that isn’t tied to a specific execution environment (such as GitHub Actions). That has always been the vision for Pythia Cookbooks but we haven’t got there yet.

So that’s my long-winded way of saying “yes let’s do this, I’m not sure how but please keep me in the loop”

1 Like

Thanks for summarizing this discussion @rabernat and everyone else here! I like the BinderBot + JupyterBook approach mentioned here, and would be willing to help with these efforts.

I apologize for missing yesterday’s meeting about this - I am at a conference this week (on European time), and will be out next week for vacation, but am willing to contribute towards this/help with next steps the week of September 12th.

I think starting with some of the existing cookbooks as a use-case would be great, and figuring out how we can enable this directly within the JupyterBook project would be fantastic!

Very interested to participate in this as well! For hackweeks at UW we create crowd-sourced jupyterbooks, but executing the notebooks via Github Actions has forced us into a few simplifications: 1. Tutorial notebooks are all in the same repo and 2. We ask everyone to limit examples for computational limits of Actions Runners (which is acutally ok for also enabling mybinder.org with about the same resource limits). We ended up using a somewhat complex environment setup with conda-lock to ensure the same environment on both the hub and CI runners.

how it might be implemented.

I’m sure there are lots of options, but one that’s been discussed previously:

  1. Build a docker image pushed someplace public (e.g. quay.io), this can easily be done with github actions.
  2. Allow JupyterHub/Binder to spawn from an existing image directly (Option to launch a binder directly from a dockerhub image (bypass repo2docker completely) · Issue #1298 · jupyterhub/binderhub · GitHub).
  3. Allow parameters in the API execution URL to specify computational requirements (Cloud region, CPU, RAM, GPU, etc). Select pod resources from binder UI · Issue #731 · jupyterhub/binderhub · GitHub

Hi there, also I interested to help here, either on Pangeo Gallery notebook migration or on developing some part of the new build system once some technical solution is agreed upon!

@rabernat, this looks great. How do we move forward? Thanks!

We have identified a clear community need. In my opinion, two things are needed to move forward.

  1. A technical lead: someone who will take ownership for the overall architecture design and seeing that the various different open-source pieces plug together in such a way that solves the original use cases. The lead can identify specific development that needs to occur and open issues in the relevant projects. This could be someone from Pythia, or it could potentially be someone from 2i2c such as @choldgraf. This is a question about who has the bandwidth. But without a single lead pushing this forward, I don’t think it will succeed.
  2. Developer time. We have already identified several areas where new features and functionality are needed in existing open source packages. (The technical lead will surely identify more in the course of thinking this through.) Implementing these features will take time from skilled developers. How much time, and what skills are needed, is not yet fully clear.

Both of these things ultimately cost money in the form of peoples’ time. That time can come from either folks who are already allocated to work on Pythia, or, in the case of 2i2c, it can come via subcontracts. (To be clear, I don’t even know if 2i2c has the capacity to take on more work even if money is available.) For my part, I am happy to see the current subcontract between Columbia and 2i2c support this effort, as it aligns directly with the existing scope and aims of that contract. However, Chris would need to tell me whether that is feasible, or whether 2i2c developer time is already saturated on existing projects.

Pythia folks should also perhaps be asking whether a subcontract with 2i2c could complement their efforts around Jupyter / Jupyter Book and enable development outcomes to be achieved more quickly.

Quick update: The Pythia Infrastructure Working met today and were joined by @jmunroe. The team wants to do a deeper dive into the implications for the Pythia team. We’ll follow up in a couple of weeks.

Cc: @brian-rose, @ktyle

I’ve been tinkering with a potential “BinderBot + JupyterBook” solution for farming out the execution of notebooks to an external Binder service before running jupyter-book build.

I have a draft PR open on the Pythia Foundations repo and would love some eyes on it!

It’s pretty straightforward and I’m surprised that this actually works out of the box, but BinderBot is a pretty neat tool! (thanks @rabernat)

Basically what happens here is:

  • An environment is created on GitHub Actions that includes jupyter-book and binderbot
  • We use binderbot to
    • execute all the notebooks on an existing BinderHub instance
    • download the executed notebooks back to GitHub Actions
  • Call jupyter-book build to render the book using the executed notebooks
  • Display the preview of the rendered book

In the open PR, I’m taking advantage of an experimental BinderHub instance that @ktyle is running on jetstream2. The Pythia team is most likely going to scope out a request for a more permanent allocation on that service. But the great thing about this approach is that you could point binderbot toward any available BinderHub.

Then, as long as the “Binder launch” buttons on the JupyterBook pages are pointing to the same service, the user can launch directly into the same environment that actually rendered the book pages. Pretty cool!

5 Likes