Saving larger-than-memory objects to zarr using dask and xarray

maxrjones · November 26, 2024, 6:20pm

why does the .to_zarr() call need everything in memory / is it possible to set it up so that it doesn’t?

I think because your workflow uses both an operation that requires the full spatial extent of the data (regridding using xemsf) and an operation that requires the full temporal extent of the data (creating one chunk along the time dimension). The combination requires the full dataset being loaded into memory. AFAIK whether it’s possibly to avoid this depends on the chunking of the input datasets, but I expect your best bet is probably caching the regridded output before rechunking. I recommend checking out @norlandrhagen’s blog post on downscaling pipelines, which also includes a regrid and rechunk step.

A few other comments

You probably want to check if using the defaults for open_mfdataset is the right choice for your data (see Stricter defaults for concat, combine, open_mfdataset · Issue #8778 · pydata/xarray · GitHub)
I view using both Dask Array and Delayed objects as a bit of an orange flag. I’m not sure why you’d need to do that in this case rather than just using Xarray’s integrated parallelism with Dask.Array objects.
XESMF without pre-generated weights is the slowest and most memory intensive of the commonly available regridding options (see Summary – Geospatial reprojection in Python (2024)). While conservative regridding is a good reason to put up with it, it’s worth considering separating out that step from the application of the regridder.

Topic		Replies	Views
Xarray to Zarr Parallel Writes with Dask Distributed Data	8	3708	July 26, 2022
Why does concatenating zarr.ZipStores with xarray and dask use so much unmanaged memory? Data	0	842	September 14, 2021
Reading larger than memory HDF data and writing concatenated xarray (or Zarr) dataset on HPC Data	13	2404	October 8, 2020
Memory requirements tor converting a netcdf multifile dataset to zarr Data	3	839	May 18, 2022
Best practice for memory management to iteratively write a large dataset with xarray Data	0	1368	December 4, 2021

Saving larger-than-memory objects to zarr using dask and xarray

Related topics