What's Next - Software - Massive Scale

Thank you Matt for the extremely good summary of the current situation.

I want to add a few more perspectives, than make a suggestion.

(@dcherian might have more thoughts though)

Xarray perspective

Ideally xarray would not have to care about any of this, it just opens things using a backend someone else wrote and puts them in (chunked) arrays that it sees as basically equivalent (via the array API standard).

open_mfdataset is a notable exception, and really doesn’t scale well :frowning_face: Not building indexes upon opening might help though. We would love help in improving this, but this inefficiency is why we currently recommend opening a single zarr store if you can. Unfortunately that recommendation is poorly-documented, and usually requires using kerchunk, which is powerful but has usability issues.

Note there has also been a flurry (1) of (2) PRs (3) to remove inefficiencies in the way xarray communicates with Zarr recently (mostly thanks to the EarthMover team).

Cubed perspective

Some things might be of interest to both Cubed and Dask (please add any more you can think of @tomwhite):

  • Query optimization - Array-level query optimization is fascinating. Can it be done in a way that benefits any distributed array backend, not just dask.array? I also really like the idea of optimizations being able to push all the way up to reading Zarr summary statistics.
  • Bounded-memory rechunking - This is both a big feature and a big potential drawback of Cubed, and it would be great to have more cross-pollination of ideas here. I’ve been asking for the dask devs to give us a talk on the P2P rechunking for a while!
  • Cloud Object store IO speed - You said elsewhere that the performance mostly ends up being about reading from S3 as fast as possible. There’s some discussion of new approaches to improving that here, but if we’re all interested in it then we should team up!

I would say task ordering is not relevant to Cubed.

Distributed Arrays Working Group perspective

The major common shared interest in the Pangeo Distributed Arrays Working group is

  • Benchmarking - We are years overdue to build this benchmarking suite for Pangeo, and in the distributed arrays group we’ve also been wanting to do this for a while, but only got as far as listing some workloads of interest in GitHub - pangeo-data/distributed-array-examples. I would love to see a shared set of benchmarks, and running some regularly on Coiled would be awesome. However ideally we ran these on more general infrastructure too, maybe with some kind of (maybe GH-actions-based?) infrastructure for running both on Coiled and on other platforms (e.g. Cubed on AWS Lambda).

That group however has mostly become just me and Tom and @keewis chatting. We do have some renewed interest from Arkouda people recently (by email) however.

Suggestion - Merge Groups

Can we just merge the people interested in “Massive Scale” into the Distributed Arrays Working Group? Your name is way catchier :grin: And maybe rejig the schedule + expand the scope a bit to bring all these concerns under one tent?

1 Like