Thanks Matt and others for laying out your thoughts!
My hot take is that the elephant in the room is the complexity of using a distributed system to divide work across multiple machines. Almost everything we still struggle with is caused by this.
Xarray seems fine. Users mostly seem to find that it works for them. Of course we can improve, especially by adding features (cough flexible indexes cough) and making it easier to use (cough xr.apply_ufunc
cough), but mostly it does what it says it will do for you. (@cboettig I think this is why Matt’s post doesn’t really mention Xarray!)
Zarr seems fine too. The main complaints from zarr users seem to be asking for more features and faster development, not that it doesn’t scale in the way it promises to. (Shoutout to Earthmover for seriously improving the rate of Zarr development.)
The middle layer is where all the complexity is. By complexity I mean: conceptual complexity, layers of software architecture, lines of code (dask.distributed
+ dask
each have very roughly the same number of LoC as Xarray), as well as in “complex system” (i.e. unpredictable non-linear behaviour).
If you only want to run on one machine you have many deployment options (Google Colab, Coiled Notebooks, Jupyterhub without dask). These work totally fine with just Xarray and numpy accessing a subset of a Zarr store lazily.
As soon as you try to split your computation across machines is when the stack starts to look like merely demo-ware. Specifically I see 3 main issues:
- Abstracting away parallelism (i.e. chunking and algorithm choice) from the user,
- Deployment hell (i.e. Kubernetes),
- Running an opaque distributed system.
Dask is incredibly complex. It literally aims to run any distributed computation for any data structure on any architecture. “Distributed System” in computer science is already a synonym for “irreducibly really f’ing hard”, and dask’s design aims even higher by trying to solve multiple distributed workloads in one system (e.g. dataframes, arrays, embarassingly-parallel maps).
“Dask” also means multiple things. Going top-down:
dask.array
interface that xarray interacts with,
- all the other interfaces that the Pangeo crowd don’t normally use, e.g.
dataframe
, bag
, delayed
,
dask-expr
query optimization layer / HLGs,
- The actual low-level DAG representation,
- Environment management via cloudpickling dependencies,
dask.distributed
scheduler (remember this repo alone is as big as xarray),
- OSS deployment tooling, e.g.
dask-gateway
(RIP?), dask-jobqueue
- Coiled.
This complexity is also a lot of what’s hard about deployments too. One of the hard parts of managing JupyterHubs is managing dask on kubernetes.
Matt is arguing that selling people managed dask clusters on demand can deal with this complexity for them. I don’t disagree! I think that is a good and valid option to be available, and has precedent in that in tabular world you can pay many companies to help you manage your Spark/Hadoop clusters. That will work for many people.
But if we’re talking about “Pangeo 2.0”, I am unconvinced that there isn’t some other architecture for this layer that has less incidental complexity.
-
Stephan seems to think that the original sin was (1), and the solution is to force the user to be more explicit about how their computation gets parallelized - which his project Xarray-Beam does.
-
Cubed brings a lot of new ideas. It specifically does not ship a distributed system or a deployment - however you run it it always delegates that part to another system (be it S3/Lambda, Modal, Apache-Beam, or even back to dask.delayed
/ Coiled Functions). It also was designed to address many of the same issues raised in Matt’s post:
- no kubernetes - running Cubed on Lithops on Lambda/S3 is another example of a “Raw-Cloud Architecture”,
- configuration (i.e. “what is a cloud auto-scaler”) - serverless deployment is truly and transparently elastic,
- running out of memory - Cubed explicitly tracks per-task memory usage.
-
Dask’s design could improve. I’ve been following all the great work on dask expressions, algorithm improvements, P2P rechunking etc. This could (and already has) made a big difference to user experience. But what I find ironic is that the ideal design for dask.Array
(the part I care about) looks a lot like Cubed (Cubed vs dask.array: Convergent evolution? · Issue #570 · cubed-dev/cubed · GitHub).
I don’t know what the best solution is (these alternatives have their own issues, and I am far from a distributed systems expert), but I made the Pangeo Distributed Arrays Working Group to find out if there are other possibilities beyond “just pay the dask experts to run dask”.
Generally I agree with @cboettig that serverless is a rich vein, and also with @maxrjones that there are and should be multiple options.
If we cut out those open channels, it would be natural for the open source ecosystem to fall behind and no longer provide the solutions that people need.
This is also an extremely important point.