why does the
.to_zarr()
call need everything in memory / is it possible to set it up so that it doesn’t?
I think because your workflow uses both an operation that requires the full spatial extent of the data (regridding using xemsf) and an operation that requires the full temporal extent of the data (creating one chunk along the time dimension). The combination requires the full dataset being loaded into memory. AFAIK whether it’s possibly to avoid this depends on the chunking of the input datasets, but I expect your best bet is probably caching the regridded output before rechunking. I recommend checking out @norlandrhagen’s blog post on downscaling pipelines, which also includes a regrid and rechunk step.
A few other comments
- You probably want to check if using the defaults for open_mfdataset is the right choice for your data (see Stricter defaults for concat, combine, open_mfdataset · Issue #8778 · pydata/xarray · GitHub)
- I view using both Dask Array and Delayed objects as a bit of an orange flag. I’m not sure why you’d need to do that in this case rather than just using Xarray’s integrated parallelism with Dask.Array objects.
- XESMF without pre-generated weights is the slowest and most memory intensive of the commonly available regridding options (see Summary – Geospatial reprojection in Python (2024)). While conservative regridding is a good reason to put up with it, it’s worth considering separating out that step from the application of the regridder.