Xarray to Zarr Parallel Writes with Dask Distributed

Thanks @willirath, I appreciate the information! The reason I am using intake.open_netcdf is because I found it very difficult to get the xr.open_dataset(...) engines to work with Google Cloud object storage. The intake module has proven much easier in this regard.

I have the NetCDF4 chunks internally stored at 100MB apiece, and from a previous post I know that the Dask array chunks must be configured to exactly match – or be exact multiples – of the NetCDF internal chunks for an efficient parallel read. So, the chunksize I am specifying for the Dask array comes from intake.open_netcdf(<URI>).to_dask().<DATA VARIABLE>.encoding['chunksizes']

I will try the write using the mixture of threads/processes and implement the heuristic, I’m eager to see what new information I can gather from this. You really cleared things up for me!