Notes from the Pangeo // 2i2c kick-off meeting

Background

the International Interactive Computing Collaboration (or 2i2c) is a non-profit organization that manages, develops, and supports open source workflows in interactive computing. 2i2c and Pangeo are kicking off a collaboration whereby 2i2c will operate and develop Pangeo’s cloud infrastructure, and begin pushing forward new development efforts in this area, allowing the Pangeo community to focus their efforts on research and other development efforts.

Last Friday, we had a meeting between @rabernat and several members of the 2i2c team (@sgibson91, @yuvipanda, Damian, @consideRatio, and myself) to discuss this collaboration! Here are a few highlights and ideas we discussed at the meeting.

For others that attended the meeting, feel free to make edits or add any other thoughts that you’d like to emphasize!

Operation: Migrate Pangeo hub infrastructure to 2i2c’s deployment repositories

2i2c will operate Pangeo’s cloud infrastructure via the 2i2c deployment repositories (currently in this repository). We will oversee this infrastructure and continue to improve it over time, allowing the Pangeo community to focus more of their efforts on their research work and on other parts of the scientific python stack. These hubs will continue to use cloud credits in the Pangeo community, they will simply be managed by 2i2c’s team.

Next steps

  • To migrate a hub
    • Create 2i2c hub infrastructure that mimics Pangeo deployments (including environments, auth, etc)
    • Connect these hubs to projects controlled by @rabernat, run by credits in his account (or in the case of AWS hubs, in the UW account depending on their plans)
    • Move over home directories for Pangeo users to the new project / deployment
    • Confirm that the new deployments work as expected, spin down the old deployment
    • After this, we hit steady state and move into general operation over time.
  • Hubs to prioritize for this process
    • JupyterHub on GCP
    • BinderHub on GCP
    • JupyterHub on AWS (depending on what @scottyhq and the UW team wants to do)
    • BinderHub on AWS (depending on what @scottyhq and the UW team wants to do)

To do

  • @sgibson91 familiarizes herself with the 2i2c deployment infrastructure, and tries out a few deploys to get ready for the Pangeo migration
  • Get @yuvipanda and @sgibson91 the proper credentials to be able to deploy on @rabernat’s projects
  • Then start following the process above. The goal is to finish this within a month
  • Ask @scottyhq whether he had plans / preferences for the AWS hubs (plans to continue running, availability of credits, etc) and plan on migrating AWS hubs accordingly.

Development: Potential projects to focus on

@sgibson91 and Damian will participate in Pangeo spaces and conversations, and use these interactions to drive new cycles of development that serve the Pangeo community. @rabernat shared this pangeo community post of a good example of what this could look like in practice.

Visions to work towards

Modular, stateless workflows in the cloud. Consider use-cases where individuals can easily migrate their work between any Pangeo (or 2i2c) hub, take their files, credentials, environments etc where they like. Users should be able to do their work anywhere and easily export or share their work and the environment needed to reproduce it. In Pangeo’s case, there should be as little hub-specific content and configuration as possible, and it should be relatively easy for Pangeo users to move between hubs in their federation.

Publishing and sharing pipelines with cloud infrastructure. Build a cloud-based workflow that allows users to host notebooks in github repositories, and share/publish those notebooks in a reproducible and discoverable manner automatically. Between pangeo gallery, pangeo forge, JupyterHub/BinderHub infrastructure, and GitHub Actions, we have all the building blocks necessary for full publishing pipelines, just need to stitch them together and figure out the right UX.

A few specific ideas

Connect hubs with external file systems. Hubs tend to have a strong reliance on a filesystem structure that tends to make notebooks hub-dependent. We should look into ways to use other services (e.g. Dropbox, GitHub, Google Drive) to store the files that you use in a hub (and should look into the fsspec project) for one possibility. This would let you take your work between hubs much more easily, as well as collaborate with others with a single “source of truth” for your content.

Make it easier to share notebooks and environments in a hub. Currently it is difficult to share notebooks and the environment needed to run them within a hub. Projects like nbgallery, nbgitpuller, etc have taken steps in this direction, but there is still not a “seamless” experience in sharing notebooks between team members. We’d like this to be as easy as possible on a Pangeo hub with minimal steps needed.

Remove artificial distinctions between BinderHub/JupyterHub. Currently BinderHub and JupyterHub have strong distinctions between them, but this is not strictly necessary. Binder-like functionality should be a feature flag on any JupyterHub, and thus BinderHub could cease to be a standalone tool, and instead be a “particular configuration of a JupyterHub”.

Improved experience in defining user permissions - this would make it easier to give certain abilities to subsets of users, and to expose this configurability to a JupyterHub admin in a way that didn’t require digging into Kubernetes internals etc.

Improved documentation and maintenance around DaskHub. DaskHub could use development and support, both around its internals and around its documentation for how others could deploy their own DaskHubs (related to Coordinate use-cases / infra around JupyterHubs+other applications on Kubernetes · Issue #382 · jupyterhub/team-compass · GitHub).

Potential integration with pangeo-forge. Finally we discussed the pangeo-forge project, which facilitates ETL pipelines for building cloud-native scientific datasets. We could explore how to support the pangeo-forge project, and build ways to more seamlessly include its functionality with the Pangeo hubs.

4 Likes

Hi @choldgraf, thanks for the detailed notes, and so thrilled that this collaboration is kicking off !!

The AWS Pangeo infrastructure has always been a prototype to push the cutting edge rather than provide long term support, so it’s great to see 2i2c’s vision to continue to innovate. The UW-managed account still has credits to operate for 3-6 months at current spending rates, but our grant supporting personnel from NASA ends in September 2021. There is no migration plan, we will simply pull the plug in September or when the credit well runs dry.

I understand GCP is the priority. That said, we’d love 2i2c to operate a hub in aws us-west-2 going forward. I think operating on at least two major cloud providers over the last year has been key for ensuring Pangeo infrastructure is truly Cloud agnostic. Sadly I don’t think we can transfer credits or current account ownership, so it would have to be a fresh start, which probably isn’t a bad thing :slight_smile:

2 Likes

Thanks for clarifying Scott! We are definitely happy to run Pangeo hubs on AWS as well, just weren’t sure what your plans were around them :slight_smile:

For credits, they don’t need to be transferred, they just need to be connected to a project that 2i2c has the permissions to deploy to. Would that work?

1 Like

Congratulations on the collaboration! Very excited to see the outcomes of these initiatives.

1 Like