I have two zarr stores on S3 representing the same data chunked differently. The dataset is straightforward, with a single variable and 3 coordinate dimensions (XYT), with just 4 directory objects. Both stores are consolidated. xarray.open_zarr
takes about 1 second on one of them and ~ 4 seconds on the other (using fsspec & s3fs 0.5.1; 30-50 seconds with s3fs 0.4.2! That’s the version installed on the AWS Pangeo Hub base notebook image). But the two stores are not very dissimilar in number of objects and chunks.
-
swe_run_a-ts.zarr
: ~ 1 sec latency. 697 files; 680 chunks & 41.40 MB per chunk -
swe_run_a-geo.zarr
: ~ 4 sec latency. 927 files; 457 chunks & 54.75 MB per chunk
#2 is larger than #1 by < 50% with respect to number of files and number of chunks, but the increase in latency is x3 to x5 (and up to x50 with s3fs 0.4.2!). Also, at 931 objects (927+4), #2 is still below the 1000 pagination threshold issue pointed out elsewhere [1], involving implicit directory listing [2].
I can live with a 4 sec latency, but it’s still mystifying why it gets so much worse for #2 relative to #1. Can anyone illuminate this? How can I tweak the chunking in #2 to keep its latency much closer to #1?
Here’s minimum code replicating the issue:
import fsspec
import xarray as xr
%%time
ts_ds = xr.open_zarr(
store=fsspec.get_mapper("s3://snowmodel/swe_run_a-ts.zarr", anon=True),
consolidated=True
)
CPU times: user 338 ms, sys: 20.1 ms, total: 358 ms
Wall time: 1.15 s
%%time
geo_ds = xr.open_zarr(
store=fsspec.get_mapper("s3://snowmodel/swe_run_a-geo.zarr", anon=True),
consolidated=True
)
CPU times: user 1.2 s, sys: 34.8 ms, total: 1.23 s
Wall time: 3.92 s
- [1] https://github.com/dask/s3fs/issues/279
- [2] https://github.com/dask/s3fs/issues/285. But it looks like this issue has already been addressed
Package and system versions ...
- s3fs: 0.5.1
- fsspec: 0.8.4
xarray.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.9.0 | packaged by conda-forge | (default, Oct 14 2020, 22:59:50)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-7648-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None
xarray: 0.16.2
pandas: 1.1.4
numpy: 1.19.4
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.6.1
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.3.3
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20201009
pip: 20.3.1
conda: None
pytest: None
IPython: 7.19.0
sphinx: None