Is writing to netcdf from Zarr sometimes slow? I am working with CM2.6 1pct CO2 sea level data on my cluster, which I’ve rechunked into time chunks using the rechunker package. I would like to process it by detrending and removing the seasonal cycle. Since these operations each multiply the number of tasks, I’d like to save the data as netcdf between processing steps (for example, save the detrended timeseries and then save the detrended + seasonal cycle-removed timeseries).
For some reason on my HPC cluster this has been quite slow… even just resaving the rechunked Zarr dataset as NetCDF won’t complete within 2 days. This is despite the fact that I am able to run the analogous script for processing and saving sea level anomalies from the CM2.6 picontrol dataset as netcdf well within 6h time limit allowed. So I am rather confused as to what the issue could be.
Writing large netCDF files in parallel is nearly always slow. That’s why Zarr was created! Why not use Zarr as your intermediate storage container, rather than NetCDF? Have you tried that?
Hi @rabernat, thanks for the quick reply! I’ve tried using zarr as my intermediate storage, but I believe my issue is that the number of tasks keeps ballooning. For instance, for the rechunked dataset is 1e4 tasks, the detrended dataset becomes ~3e5 tasks, and removing the seasonal cycle gives me tons of performance warnings: “slicing with an out of order index is generating 20 times more chunks”. Eventually my cluster just kills this process. Am I identifying this issue correctly?
I feel like we may be mixing together a few separate issues in this discussion.
I see your point about the number of tasks. But I don’t understand what it has to do with Zarr vs. NetCDF. Writing either formats will require the same number of tasks. Perhaps you mean that you want the Zarr chunks you write to differ from your Dask chunks? That is possible by setting the encoding['chuks'] property. (See xarray docs.)
There are some things you can do to reduce the number of tasks. If you have already rechunked your dataset to be contiguous in time, and all your operations can be parallel in space, you should be able to do everything in one task using map_blocks. Here is an example of that.
However, I emphasize that this is orthogonal to the question of parallel write performance.
That makes sense, that changing to NetCDF won’t solve the problem. Looks like map_blocks might be exactly what I was looking for though, I’ll check it out. Thanks Ryan!