Benchmark Notebook for Xarray to zarr on S3

emfdavid · August 15, 2025, 8:34pm

I built a IO Benchmark Notebook to see how fast can we read/write chunked xarray datatsets to zarr on s3 from EC2 in 2025. I would love feedback and suggestions or alternate examples from the experts here!

I am interested in datasets that fit in memory, so my examples is not using dask distributed or coiled and just running on a single host VM (but it could be fun to scale it up).

I looked around for the latest state of the art and decided to try comparing the python backend store of zarr3 with using Rust obstore. I tried this with and without adding gzip compression.

I saved the notebook with the timing results in the gist above. As you would expect, the fastest write time is for a moderate number of chunks (not too many, not too few). For read and write, the more chunks you have the more difference obstore makes compared with python, but the minimum time for a given dataset size is still going to have fairly large chunks due to the overhead in python. Could we eventually do just as well as obstore with free threaded python (I am using python 3.12 with the GIL)?

A side note: My benchmark example is not a standard geo-grid with numeric coordinates (but I think the IO benchmark is valuable regardless of your coordinate datatype). I am interested in supporting a string/char coordinate for asset_ids. This produces some fairly scary looking warning messages (I want to be able to read the data in javascript - it seems to work okay with https://zarrita.dev/):

UnstableSpecificationWarning: The data type (FixedLengthUTF32(length=1, endianness='little')) does not have a Zarr V3 specification. That means that the representation of arrays saved with this data type may change without warning in a future version of Zarr Python. Arrays stored with this data type may be unreadable by other Zarr libraries. Use this data type at your own risk! Check https://github.com/zarr-developers/zarr-extensions/tree/main/data-types for the status of data type specifications for Zarr V3.
  v3_unstable_dtype_warning(self)

Any advice on how to handle string/char data better in zarr is most welcome!

For anyone who uses tabular data frames to store data that could be stored as a dense array, you can see the difference in the memory footprint at the bottom of the notebook - the multi index of timestamp and asset id is almost 5Gb alone for the tabular data structure that I am planning to replace with xarray/zarr.

TomAugspurger · August 16, 2025, 3:01am

Thanks for sharing. A couple quick notes:

@TomNicholas has been looking into xarray’s performance. See How should Xarray control asynchronous calls? · Issue #10622 · pydata/xarray · GitHub and linked issues. At the moment, xarray’s usage of zarr-python isn’t zero overhead (and zarr-python has room for improvement: Codec pipeline performance · Issue #2904 · zarr-developers/zarr-python · GitHub).
It’s worth confirming, but I suspect that free-threading will have relatively little benefit for Zarr workloads. The main thread will use asyncio, and the actual reading from / writing to stores will be async / non-blocking. zarr-python does use a thread pool for encoding and decoding bytes, but the vast majority of that time should not be holding the GIL, and so GIL contention shouldn’t be an issue.
I think remove warnings about vlen-utf8 and vlen-bytes codecs by d-v-b · Pull Request #3268 · zarr-developers/zarr-python · GitHub might have removed that warning you saw (now that it’s standardized).
To be fairer to the parquet version, you might want to dictionary / categorical encode the levels of the MultiIndex. Or if you really wanted to stack the deck in favor of parquet, you could write multiple tables: one each for the index levels, and a final table for the values. Then these can be “joined” back together by repeating each index level as necessary (which is doable in this case since the values are on a regular grid). Fundamentals: Tensors vs. Tables - Earthmover is also relevant here.

rsignell · August 16, 2025, 4:42pm

Just a note that I was able to run this on the Pangeo-EOSC Jupyterhub hitting their S3-compatible MinIO storage, just needed to add an endpoint: Jupyter Notebook Viewer

emfdavid · August 23, 2025, 12:17am

Is there an existing benchmark with a standard configuration to test against?

Topic		Replies	Views
Extremly slow write to S3 bucket with xarray.Dataset.to_zarr Data	32	5256	December 6, 2023
Extremely slow xarray/zarr writes Data	5	778	August 22, 2024
Best practice reading zarr from s3 Cloud	8	5008	July 28, 2022
Puzzling S3 xarray.open_zarr latency Data	10	2738	August 20, 2021
New Cloud Tensor I/O Benchmarks - Zarr is fast now!	4	212	December 1, 2025

Benchmark Notebook for Xarray to zarr on S3

Related topics