This is the first of a series of posts where I brain-dump some of my ideas for the future of Pangeo’s cloud infrastructure.
We are now operating two JupyterHubs (GCP and AWS) and two BinderHub (GCP and AWS). All four hubs are configured to let users deploy Dask Gateway clusters. The hubs have been great and very useful for research. (The GCP has hundreds of users.) The BinderHubs have been great for sharing reproducible workflows, demos, training workshops etc. Pangeo Binder underlies much of Pangeo Gallery.
I think it’s safe to say that the biggest recurrent pain points for the JupyterHubs are
- The use of a single global environment for all users
- The difficulty of updating the environment. (PR to https://github.com/pangeo-data/pangeo-docker-images, followed by PR to https://github.com/pangeo-data/pangeo-cloud-federation/ to point to the new docker image, first merge to staging, then to prod)
- Desire for customized compute resources for privileged users (e.g. GPUs, more RAM, etc.)
- The difficulty of sharing content from the hubs with other hub users and with the broader public
There are at least two main solution paths for these issues, enumerated below.
The default choice is to work to make the hubs more flexible. We could customize the spawner to allow a range of different docker images, plus use profile-list to enable different machine types. On the sharing front, we could work to integrate better with GitHub to make it easier to move code in / out of the hub home directory.
This solution will still ultimately require the hub admins to select which docker images can be used. It also involves considerable dev work to customize the spawner.
It’s also worth noting that, as Pangeo and 2i2c expand, many similar JupyterHubs are likely to proliferate. This may lead to balkanization of the current community, as minor differences in these hubs may hinder interoperability / portability of projects.
Binder already offers the solution to many of the pain points. Specifically, binder places the responsibility for configuring the environment in user land. The service just provides base layer on which to run binder-compatible docker images. Users could bring their own images or choose from an existing menu. (See Option to launch a binder directly from a dockerhub image (bypass repo2docker completely) · Issue #1298 · jupyterhub/binderhub · GitHub). This would remove a major administrative toil-task while empowering users to customize their own environments. In fact, I find the binder experience to be so liberating that I often work directly from a running binder when developing a new project with an experimental, custom environment, despite the obvious downsides.
The downsides of binder in its current form are
- Home directory data are not persistent
- Compute resources are limited
However, both of these can easily be mitigated with authenticated Binder. This thread goes into that in detail.
Going a step further, I am not even convinced we need persistent home directories. The reliance on a traditional UNIX-style home directory is perhaps one of the biggest barriers to developing cloud-native workflows. Cloud native work flows should never touch a local disk, instead pushing / pulling all their data over HTTP via APIs. By eliminating the reliance on a home directory, we could push our users into better, more reproducible practices. The key element for this to work would be much tighter integration with GitHub. If user code could seamlessly sync to / from Git repos, there would be no risk of losing work, and everything would be automatically version controlled. This is how Overleaf works. VS Code also recently gained support for this type of workflow:
To summarize this vision, I made this little cartoon to illustrate how it would all fit together. BinderHub in the center, with different cloud services playing different roles in a flexible, modular system.