What's Next - Software - Massive Scale

This is part of a follow-on conversation from the “What’s next for Pangeo” discussion that took place 2023-12-06. It is part of the overarching software topic.

Xarray seems to work well up to a certain scale, and then pain starts. Examples of pain:

  1. Cluster runs out of memory
  2. Way too many tasks and so things are sluggish
  3. Jobs take a long time to start
  4. Things take a long time, even though we think the job should be fast

Doing work above, say, the terabyte scale requires more expertise than is ideal. What are some good paths for us to address this problem? Some things that have been mentioned include:

  1. Improve Xarray algorithms that maybe weren’t designed for this scale
  2. Improve Dask array algorithms that weren’t designed for this scale
  3. Implement new array engines
    • Cubed is an example using serverless approaches

What can this group do?

  • Educate users about current efforts and potential developers where they can engage
  • Educate developers about user pain
  • Help build better align development efforts towards important usage
5 Likes

Dask Array + Coiled Perspective

I’ll include some work happening on the Dask side, and some attempts doing larger workloads and what we’ve learned in Xarray.

Recent Dask Work

There are a few efforts in Dask that I think tackle a lot of this problem:

  1. Task ordering: There have been many incremental improvements here that have, I think, done a lot of good in the last six months. The most recent one is here. [WIP] Dask order rewrite critical path by fjetter · Pull Request #10660 · dask/dask · GitHub

  2. P2P rechunking: This does rechunking operations in fixed space, and is also much faster. https://discourse.pangeo.io/t/rechunking-large-data-at-constant-memory-in-dask-experimental/3266
    This has been around for a while and is nice, but still needs some work to handle the full complexity of cases where it might be used. Fortunately it’s also under very active development. Probably we should ask @hendrikmakait to give an update on recent work.

  3. Query Optimization: This isn’t done for arrays yet, but is done for dataframes. The result is a system that rewrites your code to what you should have written. In dataframes this often provides an order-of-magnitude speedup (Dask is now faster than Spark in many cases). For arrays this would mean that probably we wouldn’t have to think about chunks any longer.

    This isn’t done. Maybe Coiled folks do it in Q2 2024, but honestly we could use some energy here if there are adventurous people who want to work on this (warning, the coding problems here are challenging).

Xarray could use some help too

When benchmarking some very large problems we find that often it’s not dask.array that is broken, but Xarray. For example, if you want to load 100,000 netcdf files with open_mfdataset, the way Xarray loads the data is somewhat broken. Actually even making 100,000 s3fs objects is somewhat broken.

High performance computing isn’t about doing one thing well, it’s about doing nothing poorly. The entire stack will need attention. This likely isn’t something that can be fixed in just Dask alone. Many projects will need to care about it.

Suggestion: benchmarking

In the Dask+Coiled team we’ve found great value from benchmarking. It helps us to make decisions and identify bottlenecks effectively. This is what has led to a 10x performance boost in dask.dataframe in the last year. We should do something similar for Pangeo. We already do a bit of this (see https://github.com/coiled/benchmarks/blob/main/tests/benchmarks/test_zarr.py) but it’s nascent. We could use much larger examples, like Processing a 250 TB dataset with Coiled, Dask, and Xarray — Coiled Blog

This requires people in this community to step up I think. We (Coiled) are happy to host and run large-scale benchmarks regularly, but we need help in determining what they should be. I’ll propose this as a good activity during AGU.

2 Likes

Thank you Matt for the extremely good summary of the current situation.

I want to add a few more perspectives, than make a suggestion.

(@dcherian might have more thoughts though)

Xarray perspective

Ideally xarray would not have to care about any of this, it just opens things using a backend someone else wrote and puts them in (chunked) arrays that it sees as basically equivalent (via the array API standard).

open_mfdataset is a notable exception, and really doesn’t scale well :frowning_face: Not building indexes upon opening might help though. We would love help in improving this, but this inefficiency is why we currently recommend opening a single zarr store if you can. Unfortunately that recommendation is poorly-documented, and usually requires using kerchunk, which is powerful but has usability issues.

Note there has also been a flurry (1) of (2) PRs (3) to remove inefficiencies in the way xarray communicates with Zarr recently (mostly thanks to the EarthMover team).

Cubed perspective

Some things might be of interest to both Cubed and Dask (please add any more you can think of @tomwhite):

  • Query optimization - Array-level query optimization is fascinating. Can it be done in a way that benefits any distributed array backend, not just dask.array? I also really like the idea of optimizations being able to push all the way up to reading Zarr summary statistics.
  • Bounded-memory rechunking - This is both a big feature and a big potential drawback of Cubed, and it would be great to have more cross-pollination of ideas here. I’ve been asking for the dask devs to give us a talk on the P2P rechunking for a while!
  • Cloud Object store IO speed - You said elsewhere that the performance mostly ends up being about reading from S3 as fast as possible. There’s some discussion of new approaches to improving that here, but if we’re all interested in it then we should team up!

I would say task ordering is not relevant to Cubed.

Distributed Arrays Working Group perspective

The major common shared interest in the Pangeo Distributed Arrays Working group is

  • Benchmarking - We are years overdue to build this benchmarking suite for Pangeo, and in the distributed arrays group we’ve also been wanting to do this for a while, but only got as far as listing some workloads of interest in GitHub - pangeo-data/distributed-array-examples. I would love to see a shared set of benchmarks, and running some regularly on Coiled would be awesome. However ideally we ran these on more general infrastructure too, maybe with some kind of (maybe GH-actions-based?) infrastructure for running both on Coiled and on other platforms (e.g. Cubed on AWS Lambda).

That group however has mostly become just me and Tom and @keewis chatting. We do have some renewed interest from Arkouda people recently (by email) however.

Suggestion - Merge Groups

Can we just merge the people interested in “Massive Scale” into the Distributed Arrays Working Group? Your name is way catchier :grin: And maybe rejig the schedule + expand the scope a bit to bring all these concerns under one tent?

1 Like

Thanks for the response Tom. Some comments:

open_mfdataset

I don’t think it’s enough to point people to Zarr. I’m pretty committed to also making reading tons of NetCDF files in the cloud fast. My sense is that this is actually pretty doable today if we parallelize the metadata reads open_mfdataset(..., parallel=True) but then merging that metadata today performs poorly. @jrbourbeau has been working on this a bit recently.

Query Optimization

A lot of dask-expr has nothing to do with Dask. It has to do with writing down a symbolic expression system that matches the semantics of a common library (like pandas or numpy). We then add Dask protocols to these objects so that they can operate well in a Dask context. That work is pretty orthogonal though. I think that you could probably achieve what you wanted in dask-expr if you cared to.

Watch out though! There are hard algorithmic problems here. For example we’ll need to figure out how to pass slicing operations through broadcasting operations. It’s not for the faint of heart.

Bounded Memory Rechunking

As I mentioned above, I think it’s probably time for Dask core devs to talk to Xarray folks about where we are today. I’ve mentioned this above in this thread and have already raised a Coiled-internal issue about a webinar on this topic.

This team is at an offsite this week (integrating dask-expr into dask.dataframe) and I’m guessing that they’ll be focused on dask.dataframe into early Q1, but later in Q1 could be a good time for this I think.

Benchmarking

Openness

Just to be clear, even though coiled benchmarks has “coiled” in the name it’s pretty open. Benchmarks are triggered by Github Actions, source is open, people outside of Coiled have write permissions, it benchmarks projects that aren’t Dask (like DuckDB, Polars, Spark), etc… There’s nothing to stop it from growing.

If you want to benchmark something like Cubed on AWS Lambda someone will have to pay for that though. We’d put those credentials somewhere in a Github Actions Secret.

Determining benchmarks

Cool, everyone is saying “I’d love to see a shared set of benchmarks”. How can we make this happen? Do you personally have time to do some work here? Maybe @dcherian from Earthmover? Maybe someone else?

Coiled folks are happy to do some work here (we already have some benchmarks pulled from various github repos) but we’re not experts here. This really has to come from you all. If you have time to devote towards this that would be very welcome (although of course you have no obligation to do so).

Merge Groups

Sure? What is the concrete step that’s being proposed here? Are you asking someone in particular to attend a regular meeting? (if so, my personal answer is “no, but I’m happy to talk ad-hoc any time when things come up”)

I care about distributed arrays, and I’m happy to devote resources of those around me to this effort. However, I also need some things in return (like people to make benchmarks, maybe someone to help flesh out the symbolic part of dask_expr.array. I’m very open to any mutually beneficial collaboration.

My personal experience is that working groups tend to be more about talking and less about working. With that in mind I’m more motivated to collaborate on specific work products.

1 Like

open_mfdataset

You’re right that we can and should do better here. Help would be appreciated! I’m interested to know what @jrbourbeau has been doing.

Query optimization

Understood. For anyone who wants to follow this the relevant github issue is here Integrate Dask Arrays properly · Issue #446 · dask-contrib/dask-expr · GitHub.

Bounded-Memory Rechunking

Sounds great! Looking foward to it.

Benchmarking

“coiled” in the name it’s pretty open. Benchmarks are triggered by Github Actions, source is open, people outside of Coiled have write permissions, it benchmarks projects that aren’t Dask

Okay that is very useful to know - I think I was operating under an unexamined assumption that this had to be a separate effort. I will talk to @tomwhite about that.

Do you personally have time to do some work here? Maybe @dcherian from Earthmover? Maybe someone else?

I don’t know yet.

Merge groups

Sure? What is the concrete step that’s being proposed here? Are you asking someone in particular to attend a regular meeting?

I guess I’m re-extending the invitation for anyone who wants to come to the distributed arrays working group meetings (one-off or regular), and clarifying there are multiple common performance questions that we could be collaborating on as part of that group. I’m also happy to organise special meetings to discuss the above common topics.

1 Like

tree-reducing the combine should help a bit here but fundamentally a collection of netCDFs is not guaranteed to be a consistent dataset so our hands are tied a bit :slight_smile:

I’ve asked @jrbourbeau to write up his recent experiences and efforts. My hope is that that’ll catalyse some good joint efforts.

3 Likes

Hi all,

This is an “end of year don’t let me forget about this” kind of reply (short, incoherent!)

We’re starting to use Dask in the High Energy Physics (HEP) world via dask-awkward, and the kind of scaling challenges associated with this workload (e.g. graph optimisation) bare resemblance with the issues discussed here. I’m not the expert on the day-to-day observability of this, so I will point some HEP people to this thread.