Future of Pangeo Cloud I: Binder for Everything?

rabernat · June 18, 2021, 3:17pm

This is the first of a series of posts where I brain-dump some of my ideas for the future of Pangeo’s cloud infrastructure.

The Current Situation

We are now operating two JupyterHubs (GCP and AWS) and two BinderHub (GCP and AWS). All four hubs are configured to let users deploy Dask Gateway clusters. The hubs have been great and very useful for research. (The GCP has hundreds of users.) The BinderHubs have been great for sharing reproducible workflows, demos, training workshops etc. Pangeo Binder underlies much of Pangeo Gallery.

I think it’s safe to say that the biggest recurrent pain points for the JupyterHubs are

The use of a single global environment for all users
The difficulty of updating the environment. (PR to https://github.com/pangeo-data/pangeo-docker-images, followed by PR to https://github.com/pangeo-data/pangeo-cloud-federation/ to point to the new docker image, first merge to staging, then to prod)
Desire for customized compute resources for privileged users (e.g. GPUs, more RAM, etc.)
The difficulty of sharing content from the hubs with other hub users and with the broader public

There are at least two main solution paths for these issues, enumerated below.

Solution A: Make the JupyterHubs more Flexible / Customized

The default choice is to work to make the hubs more flexible. We could customize the spawner to allow a range of different docker images, plus use profile-list to enable different machine types. On the sharing front, we could work to integrate better with GitHub to make it easier to move code in / out of the hub home directory.

This solution will still ultimately require the hub admins to select which docker images can be used. It also involves considerable dev work to customize the spawner.

It’s also worth noting that, as Pangeo and 2i2c expand, many similar JupyterHubs are likely to proliferate. This may lead to balkanization of the current community, as minor differences in these hubs may hinder interoperability / portability of projects.

Solution B: Binder for Everything

Binder already offers the solution to many of the pain points. Specifically, binder places the responsibility for configuring the environment in user land. The service just provides base layer on which to run binder-compatible docker images. Users could bring their own images or choose from an existing menu. (See Option to launch a binder directly from a dockerhub image (bypass repo2docker completely) · Issue #1298 · jupyterhub/binderhub · GitHub). This would remove a major administrative toil-task while empowering users to customize their own environments. In fact, I find the binder experience to be so liberating that I often work directly from a running binder when developing a new project with an experimental, custom environment, despite the obvious downsides.

The downsides of binder in its current form are

Home directory data are not persistent
Compute resources are limited

However, both of these can easily be mitigated with authenticated Binder. This thread goes into that in detail.

Going a step further, I am not even convinced we need persistent home directories. The reliance on a traditional UNIX-style home directory is perhaps one of the biggest barriers to developing cloud-native workflows. Cloud native work flows should never touch a local disk, instead pushing / pulling all their data over HTTP via APIs. By eliminating the reliance on a home directory, we could push our users into better, more reproducible practices. The key element for this to work would be much tighter integration with GitHub. If user code could seamlessly sync to / from Git repos, there would be no risk of losing work, and everything would be automatically version controlled. This is how Overleaf works. VS Code also recently gained support for this type of workflow:

To summarize this vision, I made this little cartoon to illustrate how it would all fit together. BinderHub in the center, with different cloud services playing different roles in a flexible, modular system.

cc @choldgraf, @consideRatio, Sarah Gibson, Damian Ávila.

rabernat · June 18, 2021, 3:21pm

Adding a point I forgot to make in my original post:

One great advantage of this infrastructure is that it makes it much easier to scale / franchise Pangeo cloud infrastructure. Anyone organization who wants to support Pangeo infrastructure can just run a generic Binder service, which would be identical in all clouds and not specifically customized to any one group. Users could move seamlessly between different clouds with zero change in user experience, since the environment is totally decoupled from the hub.

I feel this is way more scalable / sustainable than running 100 slightly different Pangeo JupyterHubs.

arnim.bleier · June 18, 2021, 4:32pm

Thanks for this great article @rabernat

What is potentially worth noting is that such an infrastructure would not only benefit geoscience. The persistent BinderHub (aka notebooks.gesis.org ) mentioned above was developed for the social sciences.

Yet, the requirements are (un-) surprisingly very similar.

rabernat · June 18, 2021, 8:01pm

@arnim.bleier indeed! This could actually be a generic backplane for all kinds of scientific infrastructure. Let’s make it happen!

Your work on the Gesis binder really opened my eyes to these possibilities. Thank you so much for your contributions!

scottyhq · June 18, 2021, 9:18pm

great post @rabernat! The idea of doing away completely with persistent home directories is interesting, i’m guessing this is also how github CodeSpaces will work (Codespaces · GitHub).

I also think it would be amazing to have a federated binderhub service where it’s easy to select:

cloud-provider
data center / region
Single machine resources (GPU, RAM, CPU).

That alone would be highly valuable. Another key innovation of the current pangeo binderhubs is dask-gateway integration to scale beyond a single machine. So a natural second step would be a federation of dask gateways, or alternatively, docs on how to connect to existing managed dask cluster services such as Coiled, or Microsoft Planetary Computer.

fperez · June 19, 2021, 11:17pm

Thanks for summarizing these ideas @rabernat! I very much agree with this vision, and I think the basic pieces to make this a reality are all in place, though a lot of end-point polish and user experience improvements are still needed.

I think we should work on formalizing a bit more explicitly the layers of conventions/standards to make this possible so that federated uses and interoperability are smoother. Today with Binder we have a lot of that in place, but it’s a bit ad-hoc and some of it quite opaque (e.g., I can’t tell off the top of my head how to make a highly customized Docker image that would still be binder-safe, for example, and the docs do mention how that’s a “here be dragons” area).

But we have the right foundation, and with the vision of open interoperability that is embraced by all our inter-connected projects (jupyter, binder, pangeo, …), it would be great to establish more of this in practice (in the spirit of the 2i2c “right to replicate” post) before the air gets taken out of the room by vertically integrated, highly proprietary solutions.

rabernat · June 21, 2021, 2:21pm

I’m glad to hear this idea resonates with folks!

Perhaps we could carve out some time for a sprint on the topic of “layers of conventions / standards” at the November Jupyter / Pangeo workshop?

ktaletsk · June 29, 2021, 10:35pm

Thanks for your thoughts, and I wanted to offer some perspective from bioimaging community. We are maintaining our JupyterHub deployment for our residential data scientists and clients at NIH. The need for customized images resonates with what we would like to do as well. To that end we created a way to describe environments using combination of Jinja2-templated Dockerfiles and human-readable yamls. That allows us not only to quickly update dependencies (e.g. change pip package version in yaml through pull request, wait for CI to rebuild and use it), but also to enable customizable environments where users choose which (predefined) packages they need and our custom JupyterHub spawner will get the image with only needed features for them.
The dream however is to be able to derive the environment separately for each notebook based on metadata, either automatically based on the imports used in the notebook cells, or from user input. This will also simplify converting notebooks into reusable containerized pipeline steps in the future.

geynard · August 24, 2021, 11:44am

Thanks @rabernat for this vision! I agree with you and folks here on the outlined goal and solution, it is clearly the best for Cloud environment and establishing standards for Pangeo cloud infrastructure.

However, I’d be careful of the ability of user to “bring their own images or choose from an existing menu”, this can be a step away of reproducibility. One of the cool thing of binder is that it forces a clear description of your environment and to version control your code too. Making this change along the authentication would somehow reduce the gap between Jupyterhub and Binderhub…

In the end, if you give the ability in Binder to select your own images or from an images list, but also configuring the compute resources needed, isn’t it quite the same complexity of dev time between customizing binder page or Jupyterhub spawner form (which is noticeable, but may be not considerable)? Couldn’t we also give the ability to users to bring their own images in Jupyterhub spawner, and what would be the difference in the end (I may miss something here)?

So I think we need to establish standards/rules/goals here, and see what are the solutions then, you already did that:

Place the responsibility for configuring the environment in user land, by using repo2docker or own user image, and also choosing machine resources.
Make it much easier to scale / franchise Pangeo cloud infrastructure, by providing a generic Binder (or Jupyterhub?) service, identical in all clouds and not specifically customized to any one group.
Optionally: Cloud native work flows should never touch a (persistent) local disk: eliminate the persistent home directory by syncing to Git repos (or other storage solution, like Google drive or so one?).

This might be hard to achieve all the three goals here.

I think one of the key is really do we want to allow using dockerhub (or equivalent) images directly. I’m under the impression that the work at notebooks.gesis.org wanted to keep the “git repo only” feature of Binder that forces reproducibility.

We might also take a look at Coiled solution:

which gives another choice for answering all the goals.

Topic		Replies	Views
Resen vs. BinderHub	1	587	March 29, 2021
Notes from the Pangeo // 2i2c kick-off meeting Cloud	3	955	June 22, 2021
Pangeo Example Gallery Meta	18	2229	October 28, 2020
Amazon earth proposals for cloud credits Cloud	7	1193	October 21, 2020
What's Next - Cloud - Pangeo-managed Infrastructure	6	488	December 22, 2023

Future of Pangeo Cloud I: Binder for Everything?

The Current Situation

Solution A: Make the JupyterHubs more Flexible / Customized

Solution B: Binder for Everything

Related topics