Problem with blosc decompression and zarr files

I have a problem with random blosc decompression errors while reading zarr files that @rsignell suggested I post in pangeo-discourse.

I have a script that takes zarr files produced by rechunk (about 2500 zarr chunks) and creates a combined zarr file on our HPC system. When I ran this script last around April 2022 everything worked without any problems. Last week I ran through this workflow again and encountered a number of errors and warnings. There is one particular error that is preventing me from successfully running to completion related to the blosc decompression. As I said before this workflow is run on our HPC system (USGS denali) which has a Lustre filesystem. I submit this workflow as a SLURM job array which runs up to 30-50 tasks at a time. Each task will sequentially read a given set of zarr chunks and write them to their region in the combined zarr file. The blosc decompression error happens randomly during the tasks; some tasks run successfully and others will crash at random points. When I manually re-run any of the tasks that failed they will run successfully every time. Below is an example of the blosc decompression error message.

Traceback (most recent call last):
  File "/lustre/conus404_work/conus404_to_zarr.py", line 169, in <module>
    main()
  File "/lustre/conus404_work/conus404_to_zarr.py", line 153, in main
    dsi.to_zarr(zarr_whole, region={'time': slice(start, stop)})
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/core/dataset.py", line 2060, in to_zarr
    return to_zarr(  # type: ignore
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/backends/api.py", line 1637, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/backends/api.py", line 1257, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/backends/zarr.py", line 545, in store
    existing_vars, _, _ = conventions.decode_cf_variables(
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/conventions.py", line 521, in decode_cf_variables
    new_vars[k] = decode_cf_variable(
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/conventions.py", line 369, in decode_cf_variable
    var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name)
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/coding/times.py", line 682, in decode
    dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/coding/times.py", line 176, in _decode_cf_datetime_dtype
    [first_n_items(values, 1) or [0], last_item(values) or [0]]
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/core/formatting.py", line 76, in first_n_items
    return np.asarray(array).flat[:n_desired]
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/core/indexing.py", line 459, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/core/indexing.py", line 524, in __array__
    return np.asarray(array[self.key], dtype=None)
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/xarray/backends/zarr.py", line 76, in __getitem__
    return array[key.tuple]
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 788, in __getitem__
    result = self.get_basic_selection(pure_selection, fields=fields)
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 914, in get_basic_selection
    return self._get_basic_selection_nd(selection=selection, out=out,
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 957, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 1247, in _get_selection
    self._chunk_getitem(chunk_coords, chunk_selection, out, out_selection,
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 1951, in _chunk_getitem
    self._process_chunk(out, cdata, chunk_selection, drop_axes,
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 1894, in _process_chunk
    chunk = self._decode_chunk(cdata)
  File "/<somepath>/anaconda3/envs/pangeo/lib/python3.10/site-packages/zarr/core.py", line 2152, in _decode_chunk
    chunk = self._compressor.decode(cdata)
  File "numcodecs/blosc.pyx", line 562, in numcodecs.blosc.Blosc.decode
  File "numcodecs/blosc.pyx", line 392, in numcodecs.blosc.decompress
RuntimeError: error during blosc decompression: -1

Here are the versions for the related package I currently have installed:

numcodecs                 0.10.2
numpy                     1.22.4
python                    3.10.5
xarray                    2022.6.0
zarr                      2.12.0

I have tried online searches of this error message but have not found a solution yet. I have also looked at system logs (as much as I am able) to see if any system-level problems might be triggering the error but so far I see nothing useful. My current thought is this may be caused by network issues (e.g. congestion, timeouts, dropped packets, etc) because the problem seems to occur randomly.

Has anyone seen this problem before and found an effective solution? If not, is there a set of debug settings that I could use to gather more information to solve this? If this is in fact a problem with our HPC I currently don’t have enough information to be able to pass this on to our system admin. Any help or insights would be greatly appreciated.

-parker

@pnorton-usgs , thanks for sharing this here. Welcome!

So the short story is:

  • You had a workflow a few months ago using .to_zarr() that worked.
  • After some package updates and HPC filesystem changes, it no longer works (dies with RuntimeError: error during blosc decompression: -1)
  • You would like to know how to determine whether it’s:
    • a bug in some updated package,
    • an issue with your updated environment (package incompatibility or something).
    • a user-side filesystem issue that to_zarr() now bombs out on.

Is that right?

@rsignell I think you have summarized the problem accurately.

@pnorton-usgs, I asked @jsignell and she said random decoding problems sounded like memory errors and that there is a new tab in the dask dashboard to see the worker logs. Worth a shot?

@rsignell So far I’ve only been able to trigger this error when running this workflow as a job array. Is there a way to output that information to a file or stdout during the runs?

You might be able to get that info in a Dask performance report?

1 Like

So @pnorton-usgs told me that unfortunately there is no difference in the Dask performance reports when the identical script succeeds or fails. He’s now wondering if this is some network issue related to NFS mounted drives on our USGS HPC system.

@guillaumeeb have you seen random errors like this for Dask workflows on your HPC ?

@rsignell - just a clarification, our HPC is using Lustre not NFS.

Hi @rsignell @pnorton-usgs,

Sorry, I’ve never encountered this issue, but I’ve rarely did heavy Zarr file transformation.

But I thought I remembered seeing this somewhere, and I found intermittent errors during blosc decompression of zarr chunks on pangeo.pydata.org · Issue #196 · pangeo-data/pangeo · GitHub. I’m not sure if it gives answers. I found also a more recent apparition of the problem: Intermittent blosc decompression errors · Issue #58 · pangeo-forge/pangeo-forge-recipes · GitHub.

Maybe @rabernat has thoughts?

(Rich, my user handle here is @geynard, I connected here before github authentication was plugged in…)

@geynard, excellent – I’m a little sad we didn’t find those when we searched, but I’m super glad you did and helped us out! We’ll check those out and report back!

@pnorton-usgs reported to me today that the intermittent (and therefore nonreproducible) problems go away if we replace the numcodecs Blosc compression with numcodecs zlib compression. :upside_down_face:

While we are happy to no longer have problems with our workflow, it seems like avoiding Blosc without knowing why it’s failing is a sad result.

I seem to recall something about an environment variable that controls the number of blosc threads, and this potentially conflicting with Dask’s use of threads. Is Dask involved in this? I recommend completely bypassing Dask and seeing if the problems persist. If so, open an issue on Issues · zarr-developers/numcodecs · GitHub