Xarray to Zarr Parallel Writes with Dask Distributed

jgreen · July 26, 2022, 2:14pm

Thanks @willirath, I appreciate the information! The reason I am using intake.open_netcdf is because I found it very difficult to get the xr.open_dataset(...) engines to work with Google Cloud object storage. The intake module has proven much easier in this regard.

I have the NetCDF4 chunks internally stored at 100MB apiece, and from a previous post I know that the Dask array chunks must be configured to exactly match – or be exact multiples – of the NetCDF internal chunks for an efficient parallel read. So, the chunksize I am specifying for the Dask array comes from intake.open_netcdf(<URI>).to_dask().<DATA VARIABLE>.encoding['chunksizes']

I will try the write using the mixture of threads/processes and implement the heuristic, I’m eager to see what new information I can gather from this. You really cleared things up for me!

Topic		Replies	Views
Extremely slow xarray/zarr writes Data	5	589	August 22, 2024
Advice on writing many slices from one remote zarr xarray to another Data	4	562	January 15, 2022
Saving larger-than-memory objects to zarr using dask and xarray Data zarr	9	702	December 3, 2024
xr.DataArray.chunks, np.digitize and xr.DataArray.groupby, and dask Science	2	676	January 16, 2022
Best practice advice on parallel processing a suite of zarr files with Dask and xarray Data	10	270	June 18, 2025

Xarray to Zarr Parallel Writes with Dask Distributed

Related topics