What's Next - Software - Massive Scale

Dask Array + Coiled Perspective

I’ll include some work happening on the Dask side, and some attempts doing larger workloads and what we’ve learned in Xarray.

Recent Dask Work

There are a few efforts in Dask that I think tackle a lot of this problem:

  1. Task ordering: There have been many incremental improvements here that have, I think, done a lot of good in the last six months. The most recent one is here. [WIP] Dask order rewrite critical path by fjetter · Pull Request #10660 · dask/dask · GitHub

  2. P2P rechunking: This does rechunking operations in fixed space, and is also much faster. https://discourse.pangeo.io/t/rechunking-large-data-at-constant-memory-in-dask-experimental/3266
    This has been around for a while and is nice, but still needs some work to handle the full complexity of cases where it might be used. Fortunately it’s also under very active development. Probably we should ask @hendrikmakait to give an update on recent work.

  3. Query Optimization: This isn’t done for arrays yet, but is done for dataframes. The result is a system that rewrites your code to what you should have written. In dataframes this often provides an order-of-magnitude speedup (Dask is now faster than Spark in many cases). For arrays this would mean that probably we wouldn’t have to think about chunks any longer.

    This isn’t done. Maybe Coiled folks do it in Q2 2024, but honestly we could use some energy here if there are adventurous people who want to work on this (warning, the coding problems here are challenging).

Xarray could use some help too

When benchmarking some very large problems we find that often it’s not dask.array that is broken, but Xarray. For example, if you want to load 100,000 netcdf files with open_mfdataset, the way Xarray loads the data is somewhat broken. Actually even making 100,000 s3fs objects is somewhat broken.

High performance computing isn’t about doing one thing well, it’s about doing nothing poorly. The entire stack will need attention. This likely isn’t something that can be fixed in just Dask alone. Many projects will need to care about it.

Suggestion: benchmarking

In the Dask+Coiled team we’ve found great value from benchmarking. It helps us to make decisions and identify bottlenecks effectively. This is what has led to a 10x performance boost in dask.dataframe in the last year. We should do something similar for Pangeo. We already do a bit of this (see https://github.com/coiled/benchmarks/blob/main/tests/benchmarks/test_zarr.py) but it’s nascent. We could use much larger examples, like Processing a 250 TB dataset with Coiled, Dask, and Xarray — Coiled Blog

This requires people in this community to step up I think. We (Coiled) are happy to host and run large-scale benchmarks regularly, but we need help in determining what they should be. I’ll propose this as a good activity during AGU.

2 Likes