High-Level Guidance for Zarr chunking

Hello Pangeo Community,

We would like to develop high-level guidance for Zarr providers and users to determine a chunk size and shape. This guidance should recognize that chunk size and shape likely depends on the analytical use case and expected libraries and infrastructure.

Using the calculations of @jbusecke I made an attempt to outline steps in this notebook: https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/blob/feat/determine-chunking/zarr/determine-chunk-shape.ipynb and I welcome comments here or in the draft PR: Add determine-chunk-shape notebook by abarciauskas-bgse · Pull Request #31 · developmentseed/cloud-optimized-geospatial-formats-guide · GitHub

Thanks in advance.


Never forget the classics! Chunking Data: Choosing Shapes : Unidata Developer's Blog

There’s even Python code

cc @rsignell


That is a classic, but it’s useful to note that in blog post Russ Rew assumed that one would want extracting the entire time series at a specific grid cell to take the same amount of time as extracting the entire spatial domain at a specific time step.

In the dynamic chunk determination algorithm implemented in the PR by @jbusecke cited above you specify this ratio.

In the USGS HyTEST program when we want to support both use cases, we have been picking this ratio to be much bigger, for example 20: if it takes 1 second to load a map, it takes 20 seconds to load a time series.

The rationale is that for map users expect fast, since that’s traditionally the way most data is written, whether from sensors or models. But maybe slowing them down from something like 200ms to 1s load times is acceptable.

While for time series users, they are used to very poor performance (like 30 minutes, or failing), so for them, speeding them up to 20 s load times seems like a miracle!

1 Like