Opening zarr on s3 gives different results

jaapel · November 23, 2023, 4:57pm

Hi all,

I am running into an issue when trying to switch to using zarr files using s3fs.
There is a working version of the dataset available on a virtual drive, which I can access perfectly fine with both the python zarr library and the xarray.open_zarr accessor.

I uploaded the same dataset to an s3 bucket. Here I am getting some interesting results.
The timeindex (cftime) contains duplicates when I am opening from s3:

zgroup = zarr.open(s3_store)
times = list(zgroup.time)
npt = np.asarray(times)
u, c = np.unique(npt, return_counts=True)
dup = u[c > 1]
dup

Results in a small subset of my time data to show up as duplicated when I use the data on s3, which is different! on every read.
This dup array is always empty when I read from the mounted drive. I checked the .zarray and .zmetadata files, while are of course the same between locations:

{'chunks': [365],
 'compressor': {'blocksize': 0,
  'clevel': 5,
  'cname': 'lz4',
  'id': 'blosc',
  'shuffle': 1},
 'dtype': '<i8',
 'fill_value': None,
 'filters': None,
 'order': 'C',
 'shape': [26694],
 'zarr_format': 2}

{'_ARRAY_DIMENSIONS': ['time'],
 'calendar': 'proleptic_gregorian',
 'units': 'days since 1950-01-02'}

Also, every now and then when loading the dataset via s3, I get an OverflowError:

fs = s3fs.S3FileSystem()
s3_store = zarr.storage.FSStore(url, mode="r", fs=fs, check=False, create=False)
ds = xr.open_zarr(s3_store, consolidated=True, chunks="auto") # , use_cftime=True,decode_cf=False)
ds

I am really quite lost. Any pointers to where I can start further debugging this. I find the dtype for this time coordinate quite strange ‘<i8’, but it does seem to work for the filesystem zarr.

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
File conversion.pyx:142, in pandas._libs.tslibs.conversion.cast_from_unit()

OverflowError: Python int too large to convert to C long

The above exception was the direct cause of the following exception:

OutOfBoundsDatetime                       Traceback (most recent call last)
File timedeltas.pyx:383, in pandas._libs.tslibs.timedeltas._maybe_cast_from_unit()

File conversion.pyx:144, in pandas._libs.tslibs.conversion.cast_from_unit()

OutOfBoundsDatetime: cannot convert input -8645917132517528928 with the unit 'D'

The above exception was the direct cause of the following exception:

OutOfBoundsTimedelta                      Traceback (most recent call last)
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/coding/times.py:319, in decode_cf_datetime(num_dates, units, calendar, use_cftime)
    318 try:
--> 319     dates = _decode_datetime_with_pandas(flat_num_dates, units, calendar)
    320 except (KeyError, OutOfBoundsDatetime, OutOfBoundsTimedelta, OverflowError):

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/coding/times.py:267, in _decode_datetime_with_pandas(flat_num_dates, units, calendar)
    265 if flat_num_dates.size > 0:
    266     # avoid size 0 datetimes GH1329
--> 267     pd.to_timedelta(flat_num_dates.min(), time_units) + ref_date
    268     pd.to_timedelta(flat_num_dates.max(), time_units) + ref_date

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/pandas/core/tools/timedeltas.py:223, in to_timedelta(arg, unit, errors)
    222 # ...so it must be a scalar value. Return scalar.
--> 223 return _coerce_scalar_to_timedelta_type(arg, unit=unit, errors=errors)

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/pandas/core/tools/timedeltas.py:233, in _coerce_scalar_to_timedelta_type(r, unit, errors)
    232 try:
--> 233     result = Timedelta(r, unit)
    234 except ValueError:

File timedeltas.pyx:1872, in pandas._libs.tslibs.timedeltas.Timedelta.__new__()

File timedeltas.pyx:360, in pandas._libs.tslibs.timedeltas.convert_to_timedelta64()

File timedeltas.pyx:385, in pandas._libs.tslibs.timedeltas._maybe_cast_from_unit()

OutOfBoundsTimedelta: Cannot cast -8645917132517528928 from D to 'ns' without overflow.

During handling of the above exception, another exception occurred:

OverflowError                             Traceback (most recent call last)
Cell In[3], line 5
      3 fs = s3fs.S3FileSystem()
      4 s3_store = zarr.storage.FSStore(url, mode="r", fs=fs, check=False, create=False)
----> 5 ds = xr.open_zarr(s3_store, consolidated=True, chunks="auto") # , use_cftime=True,decode_cf=False)
      6 ds

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/backends/zarr.py:900, in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, storage_options, decode_timedelta, use_cftime, zarr_version, chunked_array_type, from_array_kwargs, **kwargs)
    886     raise TypeError(
    887         "open_zarr() got unexpected keyword arguments " + ",".join(kwargs.keys())
    888     )
    890 backend_kwargs = {
    891     "synchronizer": synchronizer,
    892     "consolidated": consolidated,
   (...)
    897     "zarr_version": zarr_version,
    898 }
--> 900 ds = open_dataset(
    901     filename_or_obj=store,
    902     group=group,
    903     decode_cf=decode_cf,
    904     mask_and_scale=mask_and_scale,
    905     decode_times=decode_times,
    906     concat_characters=concat_characters,
    907     decode_coords=decode_coords,
    908     engine="zarr",
    909     chunks=chunks,
    910     drop_variables=drop_variables,
    911     chunked_array_type=chunked_array_type,
    912     from_array_kwargs=from_array_kwargs,
    913     backend_kwargs=backend_kwargs,
    914     decode_timedelta=decode_timedelta,
    915     use_cftime=use_cftime,
    916     zarr_version=zarr_version,
    917 )
    918 return ds

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/backends/api.py:573, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    561 decoders = _resolve_decoders_kwargs(
    562     decode_cf,
    563     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    569     decode_coords=decode_coords,
    570 )
    572 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 573 backend_ds = backend.open_dataset(
    574     filename_or_obj,
    575     drop_variables=drop_variables,
    576     **decoders,
    577     **kwargs,
    578 )
    579 ds = _dataset_from_backend_dataset(
    580     backend_ds,
    581     filename_or_obj,
   (...)
    591     **kwargs,
    592 )
    593 return ds

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/backends/zarr.py:982, in ZarrBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel, zarr_version)
    980 store_entrypoint = StoreBackendEntrypoint()
    981 with close_on_error(store):
--> 982     ds = store_entrypoint.open_dataset(
    983         store,
    984         mask_and_scale=mask_and_scale,
    985         decode_times=decode_times,
    986         concat_characters=concat_characters,
    987         decode_coords=decode_coords,
    988         drop_variables=drop_variables,
    989         use_cftime=use_cftime,
    990         decode_timedelta=decode_timedelta,
    991     )
    992 return ds

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/backends/store.py:58, in StoreBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
     44 encoding = filename_or_obj.get_encoding()
     46 vars, attrs, coord_names = conventions.decode_cf_variables(
     47     vars,
     48     attrs,
   (...)
     55     decode_timedelta=decode_timedelta,
     56 )
---> 58 ds = Dataset(vars, attrs=attrs)
     59 ds = ds.set_coords(coord_names.intersection(vars))
     60 ds.set_close(filename_or_obj.close)

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/dataset.py:696, in Dataset.__init__(self, data_vars, coords, attrs)
    693 if isinstance(coords, Dataset):
    694     coords = coords._variables
--> 696 variables, coord_names, dims, indexes, _ = merge_data_and_coords(
    697     data_vars, coords
    698 )
    700 self._attrs = dict(attrs) if attrs is not None else None
    701 self._close = None

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/dataset.py:425, in merge_data_and_coords(data_vars, coords)
    421     coords = create_coords_with_default_indexes(coords, data_vars)
    423 # exclude coords from alignment (all variables in a Coordinates object should
    424 # already be aligned together) and use coordinates' indexes to align data_vars
--> 425 return merge_core(
    426     [data_vars, coords],
    427     compat="broadcast_equals",
    428     join="outer",
    429     explicit_coords=tuple(coords),
    430     indexes=coords.xindexes,
    431     priority_arg=1,
    432     skip_align_args=[1],
    433 )

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/merge.py:718, in merge_core(objects, compat, join, combine_attrs, priority_arg, explicit_coords, indexes, fill_value, skip_align_args)
    715 for pos, obj in skip_align_objs:
    716     aligned.insert(pos, obj)
--> 718 collected = collect_variables_and_indexes(aligned, indexes=indexes)
    719 prioritized = _get_priority_vars_and_indexes(aligned, priority_arg, compat=compat)
    720 variables, out_indexes = merge_collected(
    721     collected, prioritized, compat=compat, combine_attrs=combine_attrs
    722 )

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/merge.py:358, in collect_variables_and_indexes(list_of_mappings, indexes)
    355     indexes_.pop(name, None)
    356     append_all(coords_, indexes_)
--> 358 variable = as_variable(variable, name=name)
    359 if name in indexes:
    360     append(name, variable, indexes[name])

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/variable.py:158, in as_variable(obj, name)
    151     raise TypeError(
    152         f"Variable {name!r}: unable to convert object into a variable without an "
    153         f"explicit list of dimensions: {obj!r}"
    154     )
    156 if name is not None and name in obj.dims and obj.ndim == 1:
    157     # automatically convert the Variable into an Index
--> 158     obj = obj.to_index_variable()
    160 return obj

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/variable.py:571, in Variable.to_index_variable(self)
    569 def to_index_variable(self) -> IndexVariable:
    570     """Return this variable as an xarray.IndexVariable"""
--> 571     return IndexVariable(
    572         self._dims, self._data, self._attrs, encoding=self._encoding, fastpath=True
    573     )

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/variable.py:2627, in IndexVariable.__init__(self, dims, data, attrs, encoding, fastpath)
   2625 # Unlike in Variable, always eagerly load values into memory
   2626 if not isinstance(self._data, PandasIndexingAdapter):
-> 2627     self._data = PandasIndexingAdapter(self._data)

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/indexing.py:1481, in PandasIndexingAdapter.__init__(self, array, dtype)
   1478 def __init__(self, array: pd.Index, dtype: DTypeLike = None):
   1479     from xarray.core.indexes import safe_cast_to_index
-> 1481     self.array = safe_cast_to_index(array)
   1483     if dtype is None:
   1484         self._dtype = get_valid_numpy_dtype(array)

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/indexes.py:469, in safe_cast_to_index(array)
    459             emit_user_level_warning(
    460                 (
    461                     "`pandas.Index` does not support the `float16` dtype."
   (...)
    465                 category=DeprecationWarning,
    466             )
    467             kwargs["dtype"] = "float64"
--> 469     index = pd.Index(np.asarray(array), **kwargs)
    471 return _maybe_cast_to_cftimeindex(index)

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/indexing.py:474, in ExplicitlyIndexedNDArrayMixin.__array__(self, dtype)
    471 def __array__(self, dtype: np.typing.DTypeLike = None) -> np.ndarray:
    472     # This is necessary because we apply the indexing key in self.get_duck_array()
    473     # Note this is the base class for all lazy indexing classes
--> 474     return np.asarray(self.get_duck_array(), dtype=dtype)

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/indexing.py:560, in LazilyIndexedArray.get_duck_array(self)
    555 # self.array[self.key] is now a numpy array when
    556 # self.array is a BackendArray subclass
    557 # and self.key is BasicIndexer((slice(None, None, None),))
    558 # so we need the explicit check for ExplicitlyIndexed
    559 if isinstance(array, ExplicitlyIndexed):
--> 560     array = array.get_duck_array()
    561 return _wrap_numpy_scalars(array)

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/coding/variables.py:74, in _ElementwiseFunctionArray.get_duck_array(self)
     73 def get_duck_array(self):
---> 74     return self.func(self.array.get_duck_array())

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/coding/times.py:321, in decode_cf_datetime(num_dates, units, calendar, use_cftime)
    319     dates = _decode_datetime_with_pandas(flat_num_dates, units, calendar)
    320 except (KeyError, OutOfBoundsDatetime, OutOfBoundsTimedelta, OverflowError):
--> 321     dates = _decode_datetime_with_cftime(
    322         flat_num_dates.astype(float), units, calendar
    323     )
    325     if (
    326         dates[np.nanargmin(num_dates)].year < 1678
    327         or dates[np.nanargmax(num_dates)].year >= 2262
    328     ):
    329         if _is_standard_calendar(calendar):

File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/coding/times.py:237, in _decode_datetime_with_cftime(num_dates, units, calendar)
    234     raise ModuleNotFoundError("No module named 'cftime'")
    235 if num_dates.size > 0:
    236     return np.asarray(
--> 237         cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True)
    238     )
    239 else:
    240     return np.array([], dtype=object)

File src/cftime/_cftime.pyx:617, in cftime._cftime.num2date()

File src/cftime/_cftime.pyx:414, in cftime._cftime.cast_to_int()

OverflowError: time values outside range of 64 bit signed integers

josephyang · November 23, 2023, 11:36pm

I’ve experienced a similar error actually quite recently, with very large value for the ‘time’ with my script unable to load the data very intermittently. One suspicion that I had was that perhaps getting the data from cloud was failing and a null value was incorrectly being interpreted.

Perhaps a bit of a hack but I tried increasing the ‘retries’ value (default is 5) to 10 and it seems to be helping. Would be curious to see if this also helps for you? This would be set with fs.retries = 10 for the s3fs file object.

jaapel · November 24, 2023, 9:49am

Thanks for the quick reply.
I tried this out, however I still end up with duplicate time values in the coordinates. I think the problem lies somewhere else

jaapel · November 24, 2023, 1:52pm

Seems like some chunks were missing from the cloud dataset, so that is solved now. I wonder why zarr did not recognize this exception.

josephyang · November 24, 2023, 5:05pm

Good to hear that it’s been solved. I didn’t experience the duplicate time value problem but I did experience issues with missing chunks recently when downloading data from the cloud.

I wonder if there is a way to enforce a check somehow to make sure that missing chunk is not incorrectly interpreted as missing value or null data?

jaapel · November 27, 2023, 8:38am

Yes some kind of integrity check listing the chunks would have helped me with a more logical error message. Checksums may be good for the contents of the chunks.

rabernat · November 27, 2023, 3:36pm

This is tricky, because Zarr currently nterprets missing chunks as missing values, in order to save storage costs. I agree than an option to expect all chunks to be present would be useful. Perhaps open a Zarr issue?

josephyang · December 16, 2023, 4:35pm

It seems like this issue was raised before and there seems to be some workarounds although I haven’t had a chance to try it yet:

How to prevent Zarr from returning NaN for missing chunks? · Issue #486 · zarr-developers/zarr-python · GitHub
Potential bad interaction with zarr and missing chunks · Issue #255 · fsspec/filesystem_spec · GitHub

Topic		Replies	Views
Puzzling S3 xarray.open_zarr latency Data	10	2639	August 20, 2021
Best practice reading zarr from s3 Cloud	8	4504	July 28, 2022
Extremly slow write to S3 bucket with xarray.Dataset.to_zarr Data	32	4907	December 6, 2023
Netcdf to Zarr best practices Data	13	10321	February 10, 2021
Extremely slow rechunking of Zarr store with xarray Data	16	3973	October 22, 2021

Opening zarr on s3 gives different results

Related topics