Hi all,
I am running into an issue when trying to switch to using zarr files using s3fs
.
There is a working version of the dataset available on a virtual drive, which I can access perfectly fine with both the python zarr
library and the xarray.open_zarr
accessor.
I uploaded the same dataset to an s3 bucket. Here I am getting some interesting results.
The timeindex (cftime) contains duplicates when I am opening from s3:
zgroup = zarr.open(s3_store)
times = list(zgroup.time)
npt = np.asarray(times)
u, c = np.unique(npt, return_counts=True)
dup = u[c > 1]
dup
Results in a small subset of my time data to show up as duplicated when I use the data on s3, which is different! on every read.
This dup
array is always empty when I read from the mounted drive. I checked the .zarray
and .zmetadata
files, while are of course the same between locations:
{'chunks': [365],
'compressor': {'blocksize': 0,
'clevel': 5,
'cname': 'lz4',
'id': 'blosc',
'shuffle': 1},
'dtype': '<i8',
'fill_value': None,
'filters': None,
'order': 'C',
'shape': [26694],
'zarr_format': 2}
{'_ARRAY_DIMENSIONS': ['time'],
'calendar': 'proleptic_gregorian',
'units': 'days since 1950-01-02'}
Also, every now and then when loading the dataset via s3, I get an OverflowError
:
fs = s3fs.S3FileSystem()
s3_store = zarr.storage.FSStore(url, mode="r", fs=fs, check=False, create=False)
ds = xr.open_zarr(s3_store, consolidated=True, chunks="auto") # , use_cftime=True,decode_cf=False)
ds
I am really quite lost. Any pointers to where I can start further debugging this. I find the dtype for this time coordinate quite strange ‘<i8’, but it does seem to work for the filesystem zarr.
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
File conversion.pyx:142, in pandas._libs.tslibs.conversion.cast_from_unit()
OverflowError: Python int too large to convert to C long
The above exception was the direct cause of the following exception:
OutOfBoundsDatetime Traceback (most recent call last)
File timedeltas.pyx:383, in pandas._libs.tslibs.timedeltas._maybe_cast_from_unit()
File conversion.pyx:144, in pandas._libs.tslibs.conversion.cast_from_unit()
OutOfBoundsDatetime: cannot convert input -8645917132517528928 with the unit 'D'
The above exception was the direct cause of the following exception:
OutOfBoundsTimedelta Traceback (most recent call last)
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/coding/times.py:319, in decode_cf_datetime(num_dates, units, calendar, use_cftime)
318 try:
--> 319 dates = _decode_datetime_with_pandas(flat_num_dates, units, calendar)
320 except (KeyError, OutOfBoundsDatetime, OutOfBoundsTimedelta, OverflowError):
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/coding/times.py:267, in _decode_datetime_with_pandas(flat_num_dates, units, calendar)
265 if flat_num_dates.size > 0:
266 # avoid size 0 datetimes GH1329
--> 267 pd.to_timedelta(flat_num_dates.min(), time_units) + ref_date
268 pd.to_timedelta(flat_num_dates.max(), time_units) + ref_date
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/pandas/core/tools/timedeltas.py:223, in to_timedelta(arg, unit, errors)
222 # ...so it must be a scalar value. Return scalar.
--> 223 return _coerce_scalar_to_timedelta_type(arg, unit=unit, errors=errors)
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/pandas/core/tools/timedeltas.py:233, in _coerce_scalar_to_timedelta_type(r, unit, errors)
232 try:
--> 233 result = Timedelta(r, unit)
234 except ValueError:
File timedeltas.pyx:1872, in pandas._libs.tslibs.timedeltas.Timedelta.__new__()
File timedeltas.pyx:360, in pandas._libs.tslibs.timedeltas.convert_to_timedelta64()
File timedeltas.pyx:385, in pandas._libs.tslibs.timedeltas._maybe_cast_from_unit()
OutOfBoundsTimedelta: Cannot cast -8645917132517528928 from D to 'ns' without overflow.
During handling of the above exception, another exception occurred:
OverflowError Traceback (most recent call last)
Cell In[3], line 5
3 fs = s3fs.S3FileSystem()
4 s3_store = zarr.storage.FSStore(url, mode="r", fs=fs, check=False, create=False)
----> 5 ds = xr.open_zarr(s3_store, consolidated=True, chunks="auto") # , use_cftime=True,decode_cf=False)
6 ds
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/backends/zarr.py:900, in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, storage_options, decode_timedelta, use_cftime, zarr_version, chunked_array_type, from_array_kwargs, **kwargs)
886 raise TypeError(
887 "open_zarr() got unexpected keyword arguments " + ",".join(kwargs.keys())
888 )
890 backend_kwargs = {
891 "synchronizer": synchronizer,
892 "consolidated": consolidated,
(...)
897 "zarr_version": zarr_version,
898 }
--> 900 ds = open_dataset(
901 filename_or_obj=store,
902 group=group,
903 decode_cf=decode_cf,
904 mask_and_scale=mask_and_scale,
905 decode_times=decode_times,
906 concat_characters=concat_characters,
907 decode_coords=decode_coords,
908 engine="zarr",
909 chunks=chunks,
910 drop_variables=drop_variables,
911 chunked_array_type=chunked_array_type,
912 from_array_kwargs=from_array_kwargs,
913 backend_kwargs=backend_kwargs,
914 decode_timedelta=decode_timedelta,
915 use_cftime=use_cftime,
916 zarr_version=zarr_version,
917 )
918 return ds
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/backends/api.py:573, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
561 decoders = _resolve_decoders_kwargs(
562 decode_cf,
563 open_backend_dataset_parameters=backend.open_dataset_parameters,
(...)
569 decode_coords=decode_coords,
570 )
572 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 573 backend_ds = backend.open_dataset(
574 filename_or_obj,
575 drop_variables=drop_variables,
576 **decoders,
577 **kwargs,
578 )
579 ds = _dataset_from_backend_dataset(
580 backend_ds,
581 filename_or_obj,
(...)
591 **kwargs,
592 )
593 return ds
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/backends/zarr.py:982, in ZarrBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel, zarr_version)
980 store_entrypoint = StoreBackendEntrypoint()
981 with close_on_error(store):
--> 982 ds = store_entrypoint.open_dataset(
983 store,
984 mask_and_scale=mask_and_scale,
985 decode_times=decode_times,
986 concat_characters=concat_characters,
987 decode_coords=decode_coords,
988 drop_variables=drop_variables,
989 use_cftime=use_cftime,
990 decode_timedelta=decode_timedelta,
991 )
992 return ds
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/backends/store.py:58, in StoreBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
44 encoding = filename_or_obj.get_encoding()
46 vars, attrs, coord_names = conventions.decode_cf_variables(
47 vars,
48 attrs,
(...)
55 decode_timedelta=decode_timedelta,
56 )
---> 58 ds = Dataset(vars, attrs=attrs)
59 ds = ds.set_coords(coord_names.intersection(vars))
60 ds.set_close(filename_or_obj.close)
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/dataset.py:696, in Dataset.__init__(self, data_vars, coords, attrs)
693 if isinstance(coords, Dataset):
694 coords = coords._variables
--> 696 variables, coord_names, dims, indexes, _ = merge_data_and_coords(
697 data_vars, coords
698 )
700 self._attrs = dict(attrs) if attrs is not None else None
701 self._close = None
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/dataset.py:425, in merge_data_and_coords(data_vars, coords)
421 coords = create_coords_with_default_indexes(coords, data_vars)
423 # exclude coords from alignment (all variables in a Coordinates object should
424 # already be aligned together) and use coordinates' indexes to align data_vars
--> 425 return merge_core(
426 [data_vars, coords],
427 compat="broadcast_equals",
428 join="outer",
429 explicit_coords=tuple(coords),
430 indexes=coords.xindexes,
431 priority_arg=1,
432 skip_align_args=[1],
433 )
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/merge.py:718, in merge_core(objects, compat, join, combine_attrs, priority_arg, explicit_coords, indexes, fill_value, skip_align_args)
715 for pos, obj in skip_align_objs:
716 aligned.insert(pos, obj)
--> 718 collected = collect_variables_and_indexes(aligned, indexes=indexes)
719 prioritized = _get_priority_vars_and_indexes(aligned, priority_arg, compat=compat)
720 variables, out_indexes = merge_collected(
721 collected, prioritized, compat=compat, combine_attrs=combine_attrs
722 )
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/merge.py:358, in collect_variables_and_indexes(list_of_mappings, indexes)
355 indexes_.pop(name, None)
356 append_all(coords_, indexes_)
--> 358 variable = as_variable(variable, name=name)
359 if name in indexes:
360 append(name, variable, indexes[name])
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/variable.py:158, in as_variable(obj, name)
151 raise TypeError(
152 f"Variable {name!r}: unable to convert object into a variable without an "
153 f"explicit list of dimensions: {obj!r}"
154 )
156 if name is not None and name in obj.dims and obj.ndim == 1:
157 # automatically convert the Variable into an Index
--> 158 obj = obj.to_index_variable()
160 return obj
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/variable.py:571, in Variable.to_index_variable(self)
569 def to_index_variable(self) -> IndexVariable:
570 """Return this variable as an xarray.IndexVariable"""
--> 571 return IndexVariable(
572 self._dims, self._data, self._attrs, encoding=self._encoding, fastpath=True
573 )
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/variable.py:2627, in IndexVariable.__init__(self, dims, data, attrs, encoding, fastpath)
2625 # Unlike in Variable, always eagerly load values into memory
2626 if not isinstance(self._data, PandasIndexingAdapter):
-> 2627 self._data = PandasIndexingAdapter(self._data)
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/indexing.py:1481, in PandasIndexingAdapter.__init__(self, array, dtype)
1478 def __init__(self, array: pd.Index, dtype: DTypeLike = None):
1479 from xarray.core.indexes import safe_cast_to_index
-> 1481 self.array = safe_cast_to_index(array)
1483 if dtype is None:
1484 self._dtype = get_valid_numpy_dtype(array)
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/indexes.py:469, in safe_cast_to_index(array)
459 emit_user_level_warning(
460 (
461 "`pandas.Index` does not support the `float16` dtype."
(...)
465 category=DeprecationWarning,
466 )
467 kwargs["dtype"] = "float64"
--> 469 index = pd.Index(np.asarray(array), **kwargs)
471 return _maybe_cast_to_cftimeindex(index)
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/indexing.py:474, in ExplicitlyIndexedNDArrayMixin.__array__(self, dtype)
471 def __array__(self, dtype: np.typing.DTypeLike = None) -> np.ndarray:
472 # This is necessary because we apply the indexing key in self.get_duck_array()
473 # Note this is the base class for all lazy indexing classes
--> 474 return np.asarray(self.get_duck_array(), dtype=dtype)
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/core/indexing.py:560, in LazilyIndexedArray.get_duck_array(self)
555 # self.array[self.key] is now a numpy array when
556 # self.array is a BackendArray subclass
557 # and self.key is BasicIndexer((slice(None, None, None),))
558 # so we need the explicit check for ExplicitlyIndexed
559 if isinstance(array, ExplicitlyIndexed):
--> 560 array = array.get_duck_array()
561 return _wrap_numpy_scalars(array)
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/coding/variables.py:74, in _ElementwiseFunctionArray.get_duck_array(self)
73 def get_duck_array(self):
---> 74 return self.func(self.array.get_duck_array())
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/coding/times.py:321, in decode_cf_datetime(num_dates, units, calendar, use_cftime)
319 dates = _decode_datetime_with_pandas(flat_num_dates, units, calendar)
320 except (KeyError, OutOfBoundsDatetime, OutOfBoundsTimedelta, OverflowError):
--> 321 dates = _decode_datetime_with_cftime(
322 flat_num_dates.astype(float), units, calendar
323 )
325 if (
326 dates[np.nanargmin(num_dates)].year < 1678
327 or dates[np.nanargmax(num_dates)].year >= 2262
328 ):
329 if _is_standard_calendar(calendar):
File ~/miniforge3/envs/hydromtcp/lib/python3.11/site-packages/xarray/coding/times.py:237, in _decode_datetime_with_cftime(num_dates, units, calendar)
234 raise ModuleNotFoundError("No module named 'cftime'")
235 if num_dates.size > 0:
236 return np.asarray(
--> 237 cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True)
238 )
239 else:
240 return np.array([], dtype=object)
File src/cftime/_cftime.pyx:617, in cftime._cftime.num2date()
File src/cftime/_cftime.pyx:414, in cftime._cftime.cast_to_int()
OverflowError: time values outside range of 64 bit signed integers