What's Next - Software - Massive Scale

TomNicholas · December 7, 2023, 7:32pm

Thank you Matt for the extremely good summary of the current situation.

I want to add a few more perspectives, than make a suggestion.

(@dcherian might have more thoughts though)

Xarray perspective

Ideally xarray would not have to care about any of this, it just opens things using a backend someone else wrote and puts them in (chunked) arrays that it sees as basically equivalent (via the array API standard).

open_mfdataset is a notable exception, and really doesn’t scale well Not building indexes upon opening might help though. We would love help in improving this, but this inefficiency is why we currently recommend opening a single zarr store if you can. Unfortunately that recommendation is poorly-documented, and usually requires using kerchunk, which is powerful but has usability issues.

Note there has also been a flurry (1) of (2) PRs (3) to remove inefficiencies in the way xarray communicates with Zarr recently (mostly thanks to the EarthMover team).

Cubed perspective

Some things might be of interest to both Cubed and Dask (please add any more you can think of @tomwhite):

Query optimization - Array-level query optimization is fascinating. Can it be done in a way that benefits any distributed array backend, not just dask.array? I also really like the idea of optimizations being able to push all the way up to reading Zarr summary statistics.
Bounded-memory rechunking - This is both a big feature and a big potential drawback of Cubed, and it would be great to have more cross-pollination of ideas here. I’ve been asking for the dask devs to give us a talk on the P2P rechunking for a while!
Cloud Object store IO speed - You said elsewhere that the performance mostly ends up being about reading from S3 as fast as possible. There’s some discussion of new approaches to improving that here, but if we’re all interested in it then we should team up!

I would say task ordering is not relevant to Cubed.

Distributed Arrays Working Group perspective

The major common shared interest in the Pangeo Distributed Arrays Working group is

Benchmarking - We are years overdue to build this benchmarking suite for Pangeo, and in the distributed arrays group we’ve also been wanting to do this for a while, but only got as far as listing some workloads of interest in GitHub - pangeo-data/distributed-array-examples. I would love to see a shared set of benchmarks, and running some regularly on Coiled would be awesome. However ideally we ran these on more general infrastructure too, maybe with some kind of (maybe GH-actions-based?) infrastructure for running both on Coiled and on other platforms (e.g. Cubed on AWS Lambda).

That group however has mostly become just me and Tom and @keewis chatting. We do have some renewed interest from Arkouda people recently (by email) however.

Suggestion - Merge Groups

Can we just merge the people interested in “Massive Scale” into the Distributed Arrays Working Group? Your name is way catchier And maybe rejig the schedule + expand the scope a bit to bring all these concerns under one tent?

Topic		Replies	Views
Xarray unable to allocate memory, How to "size up" problem Data location-uw	9	3235	July 27, 2023
Optimizing climatology calculation with Xarray and Dask Science	33	4162	December 6, 2024
Pangeo Showcase: "Dask Array: Scaling Up for Terabyte-Level Performance" (April 9, 2025 at 12 PM ET) Pangeo Showcase	3	408	April 9, 2025
Best practices to go from 1000s of netcdf files to analyses on a HPC cluster? HPC	45	17727	July 5, 2025
Collecting problematic workloads	6	572	November 1, 2023

What's Next - Software - Massive Scale

Xarray perspective

Cubed perspective

Distributed Arrays Working Group perspective

Suggestion - Merge Groups

Related topics