Puzzling S3 xarray.open_zarr latency

I have two zarr stores on S3 representing the same data chunked differently. The dataset is straightforward, with a single variable and 3 coordinate dimensions (XYT), with just 4 directory objects. Both stores are consolidated. xarray.open_zarr takes about 1 second on one of them and ~ 4 seconds on the other (using fsspec & s3fs 0.5.1; 30-50 seconds with s3fs 0.4.2! That’s the version installed on the AWS Pangeo Hub base notebook image). But the two stores are not very dissimilar in number of objects and chunks.

  1. swe_run_a-ts.zarr: ~ 1 sec latency. 697 files; 680 chunks & 41.40 MB per chunk
  2. swe_run_a-geo.zarr: ~ 4 sec latency. 927 files; 457 chunks & 54.75 MB per chunk

#2 is larger than #1 by < 50% with respect to number of files and number of chunks, but the increase in latency is x3 to x5 (and up to x50 with s3fs 0.4.2!). Also, at 931 objects (927+4), #2 is still below the 1000 pagination threshold issue pointed out elsewhere [1], involving implicit directory listing [2].

I can live with a 4 sec latency, but it’s still mystifying why it gets so much worse for #2 relative to #1. Can anyone illuminate this? How can I tweak the chunking in #2 to keep its latency much closer to #1?

Here’s minimum code replicating the issue:

import fsspec
import xarray as xr
%%time
ts_ds = xr.open_zarr(
  store=fsspec.get_mapper("s3://snowmodel/swe_run_a-ts.zarr", anon=True),
  consolidated=True
)
CPU times: user 338 ms, sys: 20.1 ms, total: 358 ms
Wall time: 1.15 s
%%time
geo_ds = xr.open_zarr(
  store=fsspec.get_mapper("s3://snowmodel/swe_run_a-geo.zarr", anon=True),
  consolidated=True
)
CPU times: user 1.2 s, sys: 34.8 ms, total: 1.23 s
Wall time: 3.92 s
Package and system versions ...
  • s3fs: 0.5.1
  • fsspec: 0.8.4

xarray.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.9.0 | packaged by conda-forge | (default, Oct 14 2020, 22:59:50)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-7648-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.16.2
pandas: 1.1.4
numpy: 1.19.4
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.6.1
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.3.3
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20201009
pip: 20.3.1
conda: None
pytest: None
IPython: 7.19.0
sphinx: None
2 Likes

Thanks for this interesting reproducible example @emiliom!

xarray.open_zarr takes about 1 second on one of them and ~ 4 seconds on the other (using fsspec & s3fs 0.5.1; 30-50 seconds with s3fs 0.4.2

It is actually fsspec.get_mapper() that is the main change in speed between s3fs 0.4.2 and 0.5.1 behind the scenes here. newer versions use ‘Async’ which can lead to a 10x speedup opening zarr. Described in this post Understanding Async - #4 by martindurant

Can anyone illuminate this? How can I tweak the chunking in #2 to keep its latency much closer to #1?

It would be great to get @rsignell or @martindurant’s insight on this. A key difference I notice without digging into details is your “object to chunk ratio” for each case (1 vs 2). Perhaps get_mapper() is not just reading the metadata?

I’ll just also note that the single timing block reading from S3 can be sometimes misleading. I often observe 2x variability for the same code (this is running in the same datacenter confirmed by aws s3api get-bucket-location --bucket snowmodel from aws-uswest2.pangeo.io. I ran your code a few times and observe the following timings for reading geo_ds (same versions):

CPU times: user 769 ms, sys: 20.8 ms, total: 790 ms
Wall time: 2.28 s 

CPU times: user 929 ms, sys: 69.6 ms, total: 999 ms
Wall time: 1.35 s

CPU times: user 942 ms, sys: 78.4 ms, total: 1.02 s
Wall time: 1.53 s

CPU times: user 722 ms, sys: 36.5 ms, total: 759 ms
Wall time: 1.16 s
1 Like

Xarray loads coordinates eagerly.
If you do the following, you will see the calls that s3fs is actually making:

import s3fs
s3fs.core.setup_logging("DEBUG")

You will see that the first call required 9 files to be downloaded, but the second version required MANY, hundreds of keys in “swe_run_a-geo.zarr/time”.

How did you create this data? Ideally, coordinates should not be chunked at all!

4 Likes

Thanks @scottyhq! Good to know the underlying reason for the speedup. Thanks also for pointing out that relevant post.

I’ll just also note that the single timing block reading from S3 can be sometimes misleading.

Right. I did perform a couple of runs before I reported those wall times; variability was ~ 50%.

Xarray loads coordinates eagerly.
If you do the following, you will see the calls that s3fs is actually making:

import s3fs
s3fs.core.setup_logging("DEBUG")

You will see that the first call required 9 files to be downloaded, but the second version required MANY , hundreds of keys in “swe_run_a-geo.zarr/time”.

How did you create this data? Ideally, coordinates should not be chunked at all!

Thanks for that insight, @martindurant! I hadn’t thought about the impact of having chunked coordinates.

The zarr dataset was created by reading from a single large binary file in segments and writing directly to Zarr on S3, appending after writing the first segment. So, I can see how the coordinates would be created in a broken up away. However, this scheme is one I had seen in other examples when reading a large dataset from another format and writing out to Zarr in segments (eg, this one from @rsignell). How can that be avoided? Alternatively, can an existing Zarr dataset be modified in place on S3 so the coordinates are aggregated?

FYI, the code used to create the Zarr datasets is here; see specially cells 10-18.

BTW, when I tried s3fs.core.setup_logging("DEBUG") with s3fs 0.5.1 I got this error:

AttributeError: module 's3fs.core' has no attribute 'setup_logging'

That explains a lot, I hadn’t considered that pattern.

Yes, it would be possible to consolidate the data for each coordinate into a single chunk, but there is no convenience function to do this for you right now, you’d have to interact directly with the zarr API. This could be a reasonable feature request, probably to xarray.

The convenience function may also fit with rechunker package. I’ll try to make a feature request to either one of those soon.

@emiliom, thanks for bringing this up, and for @martindurant realizing that my zarr-append pattern for building up a large dataset was responsible for the chunking of the time coordinate and the slowdown.

ping @cgentemann @aimeeb because the MUR data has this same issue for the same reason, I believe.

It’s not such a big deal opening these with the latest async capabilities, but it definitely would be nice to have a function to unchunk the coordinates a zarr dataset. Seems like that should be straightforward!

1 Like

The main challenge here is that Zarr does not support in-place mutation of chunk shape. You can only create a NEW array with different chunks and then copy the coordinate data into it. So fixing these datasets would involve these steps

  • create a new array for each coord you want to rechunk within the parent group
  • copy data from old to new coord
  • remove old coord
  • rename new coord to old coord
  • (optional) reconsolidate metadata
2 Likes

Thanks @rabernat. Would this functionality be a good fit for a new feature in rechunker? It looks straightforward enough that it may be overkill to create new functionality.

2 Likes

Half a year later… are there any known updates or implementations for such a convenience function yet?