I have a problem with random blosc decompression errors while reading zarr files that @rsignell suggested I post in pangeo-discourse.
I have a script that takes zarr files produced by rechunk (about 2500 zarr chunks) and creates a combined zarr file on our HPC system. When I ran this script last around April 2022 everything worked without any problems. Last week I ran through this workflow again and encountered a number of errors and warnings. There is one particular error that is preventing me from successfully running to completion related to the blosc decompression. As I said before this workflow is run on our HPC system (USGS denali) which has a Lustre filesystem. I submit this workflow as a SLURM job array which runs up to 30-50 tasks at a time. Each task will sequentially read a given set of zarr chunks and write them to their region in the combined zarr file. The blosc decompression error happens randomly during the tasks; some tasks run successfully and others will crash at random points. When I manually re-run any of the tasks that failed they will run successfully every time. Below is an example of the blosc decompression error message.
Traceback (most recent call last):
File "/lustre/conus404_work/conus404_to_zarr.py", line 169, in <module>
main()
File "/lustre/conus404_work/conus404_to_zarr.py", line 153, in main
dsi.to_zarr(zarr_whole, region={'time': slice(start, stop)})
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/core/dataset.py", line 2060, in to_zarr
return to_zarr( # type: ignore
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/backends/api.py", line 1637, in to_zarr
dump_to_store(dataset, zstore, writer, encoding=encoding)
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/backends/api.py", line 1257, in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/backends/zarr.py", line 545, in store
existing_vars, _, _ = conventions.decode_cf_variables(
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/conventions.py", line 521, in decode_cf_variables
new_vars[k] = decode_cf_variable(
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/conventions.py", line 369, in decode_cf_variable
var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name)
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/coding/times.py", line 682, in decode
dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/coding/times.py", line 176, in _decode_cf_datetime_dtype
[first_n_items(values, 1) or [0], last_item(values) or [0]]
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/core/formatting.py", line 76, in first_n_items
return np.asarray(array).flat[:n_desired]
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/core/indexing.py", line 459, in __array__
return np.asarray(self.array, dtype=dtype)
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/core/indexing.py", line 524, in __array__
return np.asarray(array[self.key], dtype=None)
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/backends/zarr.py", line 76, in __getitem__
return array[key.tuple]
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 788, in __getitem__
result = self.get_basic_selection(pure_selection, fields=fields)
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 914, in get_basic_selection
return self._get_basic_selection_nd(selection=selection, out=out,
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 957, in _get_basic_selection_nd
return self._get_selection(indexer=indexer, out=out, fields=fields)
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 1247, in _get_selection
self._chunk_getitem(chunk_coords, chunk_selection, out, out_selection,
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 1951, in _chunk_getitem
self._process_chunk(out, cdata, chunk_selection, drop_axes,
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 1894, in _process_chunk
chunk = self._decode_chunk(cdata)
File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 2152, in _decode_chunk
chunk = self._compressor.decode(cdata)
File "numcodecs/blosc.pyx", line 562, in numcodecs.blosc.Blosc.decode
File "numcodecs/blosc.pyx", line 392, in numcodecs.blosc.decompress
RuntimeError: error during blosc decompression: -1
Here are the versions for the related package I currently have installed:
numcodecs 0.10.2
numpy 1.22.4
python 3.10.5
xarray 2022.6.0
zarr 2.12.0
I have tried online searches of this error message but have not found a solution yet. I have also looked at system logs (as much as I am able) to see if any system-level problems might be triggering the error but so far I see nothing useful. My current thought is this may be caused by network issues (e.g. congestion, timeouts, dropped packets, etc) because the problem seems to occur randomly.
Has anyone seen this problem before and found an effective solution? If not, is there a set of debug settings that I could use to gather more information to solve this? If this is in fact a problem with our HPC I currently don’t have enough information to be able to pass this on to our system admin. Any help or insights would be greatly appreciated.
-parker