I have two zarr stores on S3 representing the same data chunked differently. The dataset is straightforward, with a single variable and 3 coordinate dimensions (XYT), with just 4 directory objects. Both stores are consolidated.
xarray.open_zarr takes about 1 second on one of them and ~ 4 seconds on the other (using fsspec & s3fs 0.5.1; 30-50 seconds with s3fs 0.4.2! That’s the version installed on the AWS Pangeo Hub base notebook image). But the two stores are not very dissimilar in number of objects and chunks.
swe_run_a-ts.zarr: ~ 1 sec latency. 697 files; 680 chunks & 41.40 MB per chunk
swe_run_a-geo.zarr: ~ 4 sec latency. 927 files; 457 chunks & 54.75 MB per chunk
#2 is larger than #1 by < 50% with respect to number of files and number of chunks, but the increase in latency is x3 to x5 (and up to x50 with s3fs 0.4.2!). Also, at 931 objects (927+4), #2 is still below the 1000 pagination threshold issue pointed out elsewhere , involving implicit directory listing .
I can live with a 4 sec latency, but it’s still mystifying why it gets so much worse for #2 relative to #1. Can anyone illuminate this? How can I tweak the chunking in #2 to keep its latency much closer to #1?
Here’s minimum code replicating the issue:
import fsspec import xarray as xr
%%time ts_ds = xr.open_zarr( store=fsspec.get_mapper("s3://snowmodel/swe_run_a-ts.zarr", anon=True), consolidated=True )
CPU times: user 338 ms, sys: 20.1 ms, total: 358 ms Wall time: 1.15 s
%%time geo_ds = xr.open_zarr( store=fsspec.get_mapper("s3://snowmodel/swe_run_a-geo.zarr", anon=True), consolidated=True )
CPU times: user 1.2 s, sys: 34.8 ms, total: 1.23 s Wall time: 3.92 s
-  https://github.com/dask/s3fs/issues/279
-  https://github.com/dask/s3fs/issues/285. But it looks like this issue has already been addressed
Package and system versions ...
- s3fs: 0.5.1
- fsspec: 0.8.4
commit: None python: 3.9.0 | packaged by conda-forge | (default, Oct 14 2020, 22:59:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.3.0-7648-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: None libnetcdf: None xarray: 0.16.2 pandas: 1.1.4 numpy: 1.19.4 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.6.1 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.3.3 cartopy: None seaborn: None numbagg: None pint: None setuptools: 49.6.0.post20201009 pip: 20.3.1 conda: None pytest: None IPython: 7.19.0 sphinx: None