Best practice advice on parallel processing a suite of zarr files with Dask and xarray

dcherian · June 17, 2025, 4:12am

Yeah the core problem is in flox — a simple ds.resample(time="YS").sum().compute() raises a warning about a 300MB graph. It’s embedding data that is duplicated a lot due to the small chunk size.

Since you are calculating annual statistics, and your input chunk sizes are small (30MB), let’s rechunk it (you should do this in your open_dataset call). You can use Xarray’s new-ish TimeResampler objects to rechunk to a frequency (slick!)

from xarray.groupers import TimeResampler

ds.chunk(time=TimeResampler("10YS")) # aim for 200-300MB

Once you do this, the problem is embarassingly parallel. Your previous chunk size (468) did not match the “yearly” frequency, so there was some inter-block communication required.

For a long time, I’ve wanted to figure out automated heuristics for this but haven’t had time to do so. (hah, I even started prototyping some automated rechunking here but never finished)

Topic		Replies	Views
Zarr era5 reading causes huge number of tasks Cloud	9	1531	September 22, 2021
Optimizing climatology calculation with Xarray and Dask Science	33	4353	December 6, 2024
Xarray unable to allocate memory, How to "size up" problem Data location-uw	9	3381	July 27, 2023
xr.DataArray.chunks, np.digitize and xr.DataArray.groupby, and dask Science	2	714	January 16, 2022
Feedback on Zarr performance benchmarking HPC	1	1204	July 16, 2020

Best practice advice on parallel processing a suite of zarr files with Dask and xarray

Related topics