Slow append to existing s3 zarr store using xarray

I have a large dataset that has already been written, publicly available here: https://hackathon-o.s3-ext.jc.rl.ac.uk/sim-data/dev/v5/glm.n2560_RAL3p3/um.PT1H.hp_z10.zarr/.

I would like to append two new fields to this with:

ds_static = xr.Dataset()
ds_static['orog'] = hporog.copy()
ds_static['sftlf'] = hpland.copy()
ds_static.to_zarr(zarr_store, mode='a')

zarr_store is an s3-like datastore, and I’m writing by setting up an s3fs.S3FileSystem object - see here: wcrp_hackathon/scripts/process_um_data/um_process_tasks.py at 4838df8a93ba8dde17ecd6bacd8ef394bd7ddb50 · markmuetz/wcrp_hackathon · GitHub

However, this is absurdly slow (30 min+), and times out in my testing before any writes happen. Writing to a new store takes ~30s. I suspect it has to scan the large dataset first, which taken a very long time. Any help speeding this up would be much appreciated.

Try wrapping zarr_store in a LoggingStore to understand what sort of calls to the store are happening.

1 Like

Thanks @rabernat - I’ll give this a shot. Presumably I’m looking for lots of accesses of existing data…

One method that we’ve used is for cases such as this (when all other dimensions align and we’re simply adding a new parameter) is to just upload the data to the cloud bucket using Rclone to an appropriate location and then use the zarr.convenience.consolidate_metadata function, which works fairly quick even for remote S3 buckets.

Just to clarify…this behavior:

this is absurdly slow (30 min+), and times out in my testing before any writes happen. Writing to a new store takes ~30s.

is not normal. It should take ~30s to do the append. Something is wrong. There should be no need to resort to Rclone or other such workarounds.