What should a Pangeo 2.0 cloud tech stack look like?

Hi Folks,

I and people around me are getting a bit more focused on cloud geospatial user experience and we’re curious about what people find valulable. I’m looking to have some conversations with people who work in organizations that like the idea of the Pangeo cloud stack, even if they don’t deploy or use it day-to-day.

To aid this conversation, I wrote up some thoughts yesterday on how the Pangeo cloud architecture stack started (Kubernetes, JupyterHub, Dask-Gateway), what it evolved to today, and what problems I see with it. Those thoughts are here.

If people have time to either read the article or reach out I’d welcome several brief live conversations.

Thanks!
-matt

8 Likes

Thanks for sharing Matt, and for all your contributions to this space. As someone relatively new to this community, it’s really cool to hear in your words how the Pangeo project began, progressed, and where you see current patterns running into issues. I think that kind of reflection is important for any community even though it’s not easy to do.

I think there is some risk your post will be read though as “give up on open source deployments. coiled is pangeo 2.0”

I see much to like about coiled, but I see coiled deployment as complementary rather than an alternative to interactive deployments through JupyterHub. Much like batch computing and interactive computing are both equally valid paradigms, scripting vs compiled code, research coding vs ‘putting code into production’.

I see a rich future for JupyterHubs in providing computing to researchers who don’t or can’t (e.g. due to IT policy) install and maintain a computational environment on their own machines. It is a core part of instruction and on-boarding new users. I totally agree with you that costs are often needlessly high (and often opaque) especially when these are deployed to single large nodes rented from commercial cloud providers. But I see many exciting new things in this space, not just coiled. I believe many recent developments also address each of the issues you point to as problems of pangeo 1.0.

  • JupyterLite ecosystem for serverless/wasm based python is getting increasingly viable. Obviously this could play nicely with coiled when additional computation is required, but is not predicated on being a SaaS customer.

  • Cloud-native storage. You highlight Arraylake, which is awesome. I find the work of Source.Coop particularly exciting in this space, not least for it’s commitment to not rely on a VC model of infrastructure

  • Pangeo deployment reflects the 2i2c principle of right to replicate which is absent in your vision of Pangeo 2.0 / pivot-to-SaaS. Beyond idealism, an exciting consequence of this we see emerging are Pangeo-style deploys on platforms not tied to the private hyperscalers but publicly supported infrastructure such as the National Research Platform https://nationalresearchplatform.org which offers yet another way to address many of the systemic issues you hit upon in costs, maintenance, and scale.

I think it’s really exciting to see these VCs rising up and branching out of the Pangeo community. And I’m looking forward to seeing all the ways they give back to that community in their own way. From the outside at least Pangeo has always seemed very dynamic, it already encompasses a lot larger stack (not to mention people, ideas, and architectures) then it did 6 years ago.

Perhaps an aside, but I think Pangeo has multiple facets which are not reflected in your description or questions of what is “Pangeo 2.0”. I have never seen Pangeo as purely about infrastructure deployment. I think we all agree Pangeo is about a lot more than just deployment (software like xarray, community like this forum, etc). Surely you aren’t suggesting that xarray has no place in your vision Pangeo 2.0 – but it no longer gets a mention. It comes across poorly to take the current broad ecosystem of software, community etc that has been united by a commitment to openness, swap out one part that is not working for you with a SaaS solution in which you have financial stake, and call the resulting suite Pangeo 2.0.

7 Likes

Hey @cboettig ! Sounds like I hit a nerve and said some things that were disagreeable to you. Sorry to have caused you some stress! Definitely not my intention.

Certainly there’s lots to Pangeo like people, core OSS technology (like Dask/Xarray/Jupyter/Python/C), and also deployment technologies. You’re right that my post is mostly focused on the following:

  1. The deployment technology part of Pangeo as I see it
  2. A solicitation for people to help better inform me about their current pain and technology aspirations, so that I in my current position can best make things that will be of value to the people here.

Giving context and soliciting pain like this is a critical part of how I make software.

Of course I totally support other people in also building other things! Building things is great! Please go and build lots of things! I encourage people to interpret my post not as “this is the entirety of Pangeo” but rather as a request for people to talk to me about where they currently find pain and what they would value, so that I might be able to use that information to be more productive.

Again, so sorry if my sharing my perspective rubbed you the wrong way!

I hope that you have a pleasant weekend.

4 Likes

Thanks, Matt and Carl, for sharing your thought-provoking ideas! There’s a lot that could be discussed from these posts, but I wanted to focus on sharing two thoughts which have been informed by recent discussions with other people (as most thoughts are):

  • I worry that the statement that open source cannot move fast enough could become a self-fulfilling prophecy. I believe that it would be best for scientific computing if we avoid that. To me, and I think many others, Pangeo is a foremost a community rather than a software stack. This community has driven tremendous progress because we share our pain points in the open and connect the people experiencing those pains with people who can solve them. If we cut out those open channels, it would be natural for the open source ecosystem to fall behind and no longer provide the solutions that people need. While of course people can engage with this forum and Pangeo more broadly however they want within the confines of the code of conduct, I would encourage people to continue to share their pain points not only in private emails and SaaS logs, but also on the community Discourses and GitHub issues of open source projects. As Matt said, people should build things! More openness will allow more people to build more and better things!
  • On the infrastructure component in particular, I am hopeful that we’re moving towards an ecosystem where there are options that match individual needs and use-cases, which may involve SaaS, managed open source infrastructure, self-hosted open source infrastructure, or completely DIY solutions built on either commercial or open clouds. IMO Pangeo’s role can be to help people understand and decide based on the factors most important to them, which could be the architecture or it could be the development model. We also are in a really exciting place in which the community includes people building open source tools, VC-backed companies, and presumably much more, which I think makes Pangeo particularly well suited to provide this kind of resource. It’s probably less likely today than it was in 2017 that people will be building the nitty-gritty solutions with their Pangeo hats on, but the role of the community may be even more important now as a meeting ground (said as someone who wasn’t involved in 2017, so grains of salt abound :smile:)

Similarly, hope y’all have a great weekend and thanks for sharing your thoughts!

7 Likes

:100: Open source is clearly very powerful and something that I encourage everyone to engage in, both in writing code, and also in communicating about issues. We do a ton of this and have found really great outcomes as a result.

:+1: also on Pangeo being a place where lots of solutions can thrive and people can go to better understand which solution matches their situation and needs and values.

2 Likes

Thanks Matt and others for laying out your thoughts!

My hot take is that the elephant in the room is the complexity of using a distributed system to divide work across multiple machines. Almost everything we still struggle with is caused by this.

Xarray seems fine. Users mostly seem to find that it works for them. Of course we can improve, especially by adding features (cough flexible indexes cough) and making it easier to use (cough xr.apply_ufunc cough), but mostly it does what it says it will do for you. (@cboettig I think this is why Matt’s post doesn’t really mention Xarray!)

Zarr seems fine too. The main complaints from zarr users seem to be asking for more features and faster development, not that it doesn’t scale in the way it promises to. (Shoutout to Earthmover for seriously improving the rate of Zarr development.)


The middle layer is where all the complexity is. By complexity I mean: conceptual complexity, layers of software architecture, lines of code (dask.distributed + dask each have very roughly the same number of LoC as Xarray), as well as in “complex system” (i.e. unpredictable non-linear behaviour).

If you only want to run on one machine you have many deployment options (Google Colab, Coiled Notebooks, Jupyterhub without dask). These work totally fine with just Xarray and numpy accessing a subset of a Zarr store lazily.

As soon as you try to split your computation across machines is when the stack starts to look like merely demo-ware. Specifically I see 3 main issues:

  1. Abstracting away parallelism (i.e. chunking and algorithm choice) from the user,
  2. Deployment hell (i.e. Kubernetes),
  3. Running an opaque distributed system.

Dask is incredibly complex. It literally aims to run any distributed computation for any data structure on any architecture. “Distributed System” in computer science is already a synonym for “irreducibly really f’ing hard”, and dask’s design aims even higher by trying to solve multiple distributed workloads in one system (e.g. dataframes, arrays, embarassingly-parallel maps).

“Dask” also means multiple things. Going top-down:

  • dask.array interface that xarray interacts with,
  • all the other interfaces that the Pangeo crowd don’t normally use, e.g. dataframe, bag, delayed,
  • dask-expr query optimization layer / HLGs,
  • The actual low-level DAG representation,
  • Environment management via cloudpickling dependencies,
  • dask.distributed scheduler (remember this repo alone is as big as xarray),
  • OSS deployment tooling, e.g. dask-gateway (RIP?), dask-jobqueue
  • Coiled.

This complexity is also a lot of what’s hard about deployments too. One of the hard parts of managing JupyterHubs is managing dask on kubernetes.


Matt is arguing that selling people managed dask clusters on demand can deal with this complexity for them. I don’t disagree! I think that is a good and valid option to be available, and has precedent in that in tabular world you can pay many companies to help you manage your Spark/Hadoop clusters. That will work for many people.

But if we’re talking about “Pangeo 2.0”, I am unconvinced that there isn’t some other architecture for this layer that has less incidental complexity.

  • Stephan seems to think that the original sin was (1), and the solution is to force the user to be more explicit about how their computation gets parallelized - which his project Xarray-Beam does.

  • Cubed brings a lot of new ideas. It specifically does not ship a distributed system or a deployment - however you run it it always delegates that part to another system (be it S3/Lambda, Modal, Apache-Beam, or even back to dask.delayed / Coiled Functions). It also was designed to address many of the same issues raised in Matt’s post:

    • no kubernetes - running Cubed on Lithops on Lambda/S3 is another example of a “Raw-Cloud Architecture”,
    • configuration (i.e. “what is a cloud auto-scaler”) - serverless deployment is truly and transparently elastic,
    • running out of memory - Cubed explicitly tracks per-task memory usage.
  • Dask’s design could improve. I’ve been following all the great work on dask expressions, algorithm improvements, P2P rechunking etc. This could (and already has) made a big difference to user experience. But what I find ironic is that the ideal design for dask.Array (the part I care about) looks a lot like Cubed (Cubed vs dask.array: Convergent evolution? · Issue #570 · cubed-dev/cubed · GitHub).

I don’t know what the best solution is (these alternatives have their own issues, and I am far from a distributed systems expert), but I made the Pangeo Distributed Arrays Working Group to find out if there are other possibilities beyond “just pay the dask experts to run dask”.

Generally I agree with @cboettig that serverless is a rich vein, and also with @maxrjones that there are and should be multiple options.

If we cut out those open channels, it would be natural for the open source ecosystem to fall behind and no longer provide the solutions that people need.

This is also an extremely important point.

3 Likes

Some other comments that didn’t fit into my hot take above:

“Zarr seems cool, but we’re still using NetCDF.

Kerchunk / VirtualiZarr are a big step forward here IMO. The main disadvantages are not being able to change chunks (i.e. (1) above) and being limited by Zarr’s current feature set (e.g. no variable-length chunks yet).

Also, my data keeps updating, can Zarr expand like this?”

Earthmover are on this.

For scientific users: IMO we have not bridged HPC and Cloud yet. It’s kind of wild to me that we have such a good idea with cloud-optimised access to large datasets, but the recommended way for scientists to access crucial large datasets sat on HPC machines is to use Globus / AWS CLI to schlep it all over to the public cloud?! (Globus wasn’t even designed for this, and you have to set up extra paid subscriptions to do it.)

I want someone who understands these things better than me to tell me why we can’t just expose the data stored in HPC centres via a public object-store interface, so users can treat data in HPC centres as if they were just in another cloud region.

4 Likes

Yup, distributed computing generally adds complexity in return for performance (or at least potential performance). Distributed computing is definitely a bargain with the devil.

Tom I totally support you pursuing Cubed, as I support everyone in pursuing their own direction. As I mentioned before, I support people building things and I support lots of options.

I also want to make it clear that my post is just me sharing my subjective experience (I think the language there is pretty clearly subjective where appropriate, but please let me know if I missed something). I hope that my sharing my experience and perspective isn’t offensive to anyone. If so, I apologize.

I encourage folks who disagree with my perspective to respond by building better things! This is a space where lots of different thinking and work is likely to result in a robust and high quality set of options for users, which I think is the goal that we all share.

We’ve seen lots of building in the last few years and how users have responded to those built things. Based on my observations I’ve laid out a new path for me and people around me to chart. If people don’t like the path I’ve laid out for myself that’s ok! You don’t have to follow it. Instead, you can forge your own path!

You might want to do a lot of user research, see where people are struggling, etc… I’ve found that an intentional process around listening to lots of people outside my direct circle is always a great place to start whenever I start building something new.

1 Like

Thanks Matt. I found your post very interesting and your perspective useful! I’m just also sharing mine. Let’s continue to build in the open and have these discussions.

YES! Note that HPC centers are now doing precisely this already. Look at https://www.openstoragenetwork.org/ . Many HPC centers are exposing S3 compatible object storage using open source systems like RedHat CEPH, or MINIO (https://min.io/ – aside, but we only have to glance at all the companies using MINIO to see that open source can absolutely move fast and scale, but we already knew that).

Same deal with compute too, e.g.I mentioned https://nationalresearchplatform.org which is providing both interactive JupyterHub and containerized/batch (coiled-style) computing across a network of university-based HPCs. Naturally this uses Kubernetes under the hood. I agree with Matt that k8s is really complex and all those abstractions make it inaccessible to most researchers to deploy on their own, but here there are professional full-time devs being paid to maintain, support and scale on HPC, perhaps similarly to the way 2i2c has been able to sustainably provide this on commercial cloud. I’m sure Matt was right that those were historically important challenges, but far from k8s being a dead end, it appears to be a gateway that continues to expand the reach of Pangeo. I really applaud Matt and co for opening the door on k8s so early in the creation of Pangeo, where I can only imagine k8s was less mature and even harder to use than it is now.

I do agree that HPC has been slow to pivot here, but I think these questions about the rate of change in communities is more about the social process and not the relative speed different software models. However, not sure faster is always better – there is a lot of fantastic back-and-forth between engineers and scientists/researchers in the process. One example – Matt’s very accurate complaint of tying a JupyterHub to a single Docker image was definitely an issue, but appears to be a thing of the past. Most JupyterHubs I see present the user with rich menus of preset containers or allow users to bring their own. (Integrating BinderHub with JupyterHub: Empowering users to manage their own environments | 2i2c)

2 Likes

This is a fantastic conversation!

Great point.

And it’s worth noting that we’re moving into a world where a single machine can have hundreds of CPU cores, and a network interface card capable of sustaining hundreds of gigabits per second to cloud object storage.

I’m personally really excited to see how far we can push xarray on a single machine over the next few years.

At 200 Gbps, it should be possible to read 1 terabyte in 1 minute. Or read 1 petabyte in 14 hours. From a single machine. (Although there’s probably some software engineering to be done to achieve these theoretically-possible speeds. Every memcpy and system call matters if you want to sustain 200 Gbps on a single machine without using all your CPU cycles just to drive the IO).

(See the AnyBlob paper for how to max-out a 100 Gbps network link to cloud object storage)

3 Likes

Thanks @jack_kelly ! Any stack which can allow users to get medium-scale analysis done without the complexity of a fully distributed system would be a huge win. You’re right that we could potentially do that with serverless services or single-machine performance. You will be interested in Cubed for larger-than-memory workloads on a single machine · Issue #492 · cubed-dev/cubed · GitHub .

2 Likes

Thanks a lot @mrocklin for this post, and all the others here for this awesome discussion! Just wanted to drop a few thoughts on the post and/or the discussion (maybe from a too French/European point of view):

  • HPC: They are PaaS, almost Saas that already adress most points of what’s terrible in Matt’s article for Pangeo cloud stack. True, they have their own terrible traits, but if you have access to one of them, Pangeo improved, then it can be a great help. And from what I see around me, all major platforms are taking the Pangeo turn, even if it takes some time for those big infrastructures. As @TomNicholas mentions, next step is to make them expose their data via a public (object-store) interface.
  • Opening every organization data to the outside world, to the cloud, is happening. But as someone who’s been trying to push this for several years in french space agency: it’s slow and difficult… It needs willingness from high management layers, navigating through security concerns, even if it is technically totally feasible. There are intermediate solutions to this, like OpenEO initiative, but I still think that raw data should be accessible to anyone in an ARCO format.
  • In the European world, and probably elsewhere too (but I think not as much), there is a lot of fragmentation: HPC centers, cloud providers, Data platforms or providers, Data formats, etc… Try to understand what EOSC (Europen Open Science Cloud) is made of… We won’t change this easily so what I want here is being able to run my algorithm in any of this environment. Something like Coiled might help, but where would the compute run? Will I be able to get the data I want running in a public cloud? For now, the only solution is to push a common infrastructure and software stack, as common as possible, and to move Pangeo enabled containers back and forth. OpenEO here again is another approach, for other use cases. But I’d be typically glad here if Coiled Environment synchronization feature was some reusable open source component :slight_smile: !
  • The point above also drives things toward curated environment, it requires a really power user to develop things in multiple European platforms!
  • Arraylake is fantastic, but I don’t see how it can fit with Satellites data sets like Sentinel or alike for now. Anyway as mentionned elsewhere, Zarr will probably be the next format for Sentinel products, but not as a global array (unfortunately?). And as a per product format, it has its drawbacks…

So to sum it up, Coiled, 2i2c, Earthmoverand all those companies’ products and developements are great, they solve some problematics, but I don’t see how they can solve them all. And anyway, your feedbacks are already a great help for the community!

Trying also to answer Matt’s post final questions:

  • What key activities and pain points are people looking to solve today? What features do you need?
    • Easy interoperability with Cloud (in a broad sense) platforms. Can we do better than Docker/Repo2Docker?
    • Dask / Xarray features, certainly, better map_overlap, easy Xarray reprojections, etc…
    • Batch jobs / workflow orchestration? Yep, we always do this at one point, Argo, Prefect, and the likes. And for batch job on the cloud, not sure if there is a Slurm equivalent.
  • How do geo-specific organizations buy software? (I’m not in a geo specific organization, but we do a lot of that)
    • We used to buy licenses, like Matlab, or more specialized. It’s ending. We used to build everything ourselves. I hope it will soon end. I hope we will finaly contribute to open source instead of writing something else because one of the many thing we wanted was not working with this mature open source library…
    • We have a good deal of things on prem, but it’s not required per se. We’ve got a specialized team, so it’s pratical (we have a lot of scientist users, not power users). In more and more cases we are turning to the cloud, and could perfectly use SaaS if it proved profitable and practicale enough. Still need to understand how to pay for it though :grin:
    • Too many people take part in these decisions: high and mid level management, security, low level teams, national government… For small project it is easy to do as you want, for bigger things, there is even the political question.
    • Science user can have a big impact. They just need to be a lot, weighing in the same direction.
2 Likes