Slow append to existing s3 zarr store using xarray

markmuetz · May 6, 2025, 12:54pm

I have a large dataset that has already been written, publicly available here: https://hackathon-o.s3-ext.jc.rl.ac.uk/sim-data/dev/v5/glm.n2560_RAL3p3/um.PT1H.hp_z10.zarr/.

I would like to append two new fields to this with:

ds_static = xr.Dataset()
ds_static['orog'] = hporog.copy()
ds_static['sftlf'] = hpland.copy()
ds_static.to_zarr(zarr_store, mode='a')

zarr_store is an s3-like datastore, and I’m writing by setting up an s3fs.S3FileSystem object - see here: wcrp_hackathon/scripts/process_um_data/um_process_tasks.py at 4838df8a93ba8dde17ecd6bacd8ef394bd7ddb50 · markmuetz/wcrp_hackathon · GitHub

However, this is absurdly slow (30 min+), and times out in my testing before any writes happen. Writing to a new store takes ~30s. I suspect it has to scan the large dataset first, which taken a very long time. Any help speeding this up would be much appreciated.

rabernat · May 6, 2025, 5:40pm

Try wrapping zarr_store in a LoggingStore to understand what sort of calls to the store are happening.

markmuetz · May 6, 2025, 8:51pm

Thanks @rabernat - I’ll give this a shot. Presumably I’m looking for lots of accesses of existing data…

josephyang · May 9, 2025, 4:41pm

One method that we’ve used is for cases such as this (when all other dimensions align and we’re simply adding a new parameter) is to just upload the data to the cloud bucket using Rclone to an appropriate location and then use the zarr.convenience.consolidate_metadata function, which works fairly quick even for remote S3 buckets.

rabernat · May 9, 2025, 9:49pm

Just to clarify…this behavior:

this is absurdly slow (30 min+), and times out in my testing before any writes happen. Writing to a new store takes ~30s.

is not normal. It should take ~30s to do the append. Something is wrong. There should be no need to resort to Rclone or other such workarounds.

gunbra32 · May 13, 2025, 3:15pm

You may try zappend, a robust tool for appending to large zarrs. It has logging and profiling support and allows for safe roll backs.

markmuetz · May 13, 2025, 8:41pm

Thanks all. I agree with @rabernat - I’d prefer to stick to off-the-shelf tools. I have been completely focused on getting things ready for a hackathon this week so have not had time to explore the options. When I’ve got more time next week, I’ll try to explore/diagnose the problem (using the LoggingStore option in the first instance). For now, I went with expedience - creating a new store and combining this with the existing as part of our intake catalog.

Topic		Replies	Views
Extremly slow write to S3 bucket with xarray.Dataset.to_zarr Data	32	5124	December 6, 2023
Extremely slow xarray/zarr writes Data	5	712	August 22, 2024
Best practice reading zarr from s3 Cloud	8	4906	July 28, 2022
Best way to scale s3 zarr store to handle massive amounts of S3 ingress? Data	4	960	October 2, 2023
How to read multiple zarr archives at once from s3?	3	1588	July 5, 2022

Slow append to existing s3 zarr store using xarray

Related topics