Benchmark Notebook for Xarray to zarr on S3

I built a IO Benchmark Notebook to see how fast can we read/write chunked xarray datatsets to zarr on s3 from EC2 in 2025. I would love feedback and suggestions or alternate examples from the experts here!

I am interested in datasets that fit in memory, so my examples is not using dask distributed or coiled and just running on a single host VM (but it could be fun to scale it up).

I looked around for the latest state of the art and decided to try comparing the python backend store of zarr3 with using Rust obstore. I tried this with and without adding gzip compression.

I saved the notebook with the timing results in the gist above. As you would expect, the fastest write time is for a moderate number of chunks (not too many, not too few). For read and write, the more chunks you have the more difference obstore makes compared with python, but the minimum time for a given dataset size is still going to have fairly large chunks due to the overhead in python. Could we eventually do just as well as obstore with free threaded python (I am using python 3.12 with the GIL)?

A side note: My benchmark example is not a standard geo-grid with numeric coordinates (but I think the IO benchmark is valuable regardless of your coordinate datatype). I am interested in supporting a string/char coordinate for asset_ids. This produces some fairly scary looking warning messages (I want to be able to read the data in javascript - it seems to work okay with https://zarrita.dev/):

UnstableSpecificationWarning: The data type (FixedLengthUTF32(length=1, endianness='little')) does not have a Zarr V3 specification. That means that the representation of arrays saved with this data type may change without warning in a future version of Zarr Python. Arrays stored with this data type may be unreadable by other Zarr libraries. Use this data type at your own risk! Check https://github.com/zarr-developers/zarr-extensions/tree/main/data-types for the status of data type specifications for Zarr V3.
  v3_unstable_dtype_warning(self)

Any advice on how to handle string/char data better in zarr is most welcome!

For anyone who uses tabular data frames to store data that could be stored as a dense array, you can see the difference in the memory footprint at the bottom of the notebook - the multi index of timestamp and asset id is almost 5Gb alone for the tabular data structure that I am planning to replace with xarray/zarr.

Thanks for sharing. A couple quick notes:

1 Like

Just a note that I was able to run this on the Pangeo-EOSC Jupyterhub hitting their S3-compatible MinIO storage, just needed to add an endpoint: Jupyter Notebook Viewer

2 Likes

Is there an existing benchmark with a standard configuration to test against?