Best practice reading zarr from s3

martindurant · July 25, 2022, 11:11am

Specifically on async: s3 file system instances are always async internally. The argument is used to specify whether you will be calling the instance from async def code - you do not need it.

When combined with zarr, multiple chunks of a given variable can be requested concurrently, so you do not pay the latency cost many times over; it does not actually improve your maximum bandwidth, though. Furthermore, zarr does not currently support concurrent reads across different variables. It would be nice!

Using xarray, as opposed to zarr directly, gives you indexing by coordinate and eagerly loads the coordinate arrays. You are not using this, so perhaps it is not useful.

You are already using dask, this is assumed by open_zarr (not the same default as open_dataset(engine=“zarr”).

However, it is strange that the isel() call is taking long, and we would be interested in knowing why. There is no data loading happening, but you are constructing a dask graph. Maybe passing chunks=None helps you, in which case dask has some work to do.

Topic		Replies	Views
Puzzling S3 xarray.open_zarr latency Data	10	2655	August 20, 2021
Extremly slow write to S3 bucket with xarray.Dataset.to_zarr Data	32	4997	December 6, 2023
S3 - Zarr / NetCDF access times using s3fs Data	13	3527	April 19, 2023
Optimising Access For Zarr on S3 Data by LAT/LONG (Dask) Data	11	1680	April 25, 2022
Welcome, I need some support for the design of a forecast archive with Zarr Data	10	1175	April 23, 2022

Best practice reading zarr from s3

Related topics