High-Level Guidance for Zarr chunking

aimeeb · August 23, 2023, 2:05am

Hello Pangeo Community,

We would like to develop high-level guidance for Zarr providers and users to determine a chunk size and shape. This guidance should recognize that chunk size and shape likely depends on the analytical use case and expected libraries and infrastructure.

Using the calculations of @jbusecke I made an attempt to outline steps in this notebook: https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/blob/feat/determine-chunking/zarr/determine-chunk-shape.ipynb and I welcome comments here or in the draft PR: Add determine-chunk-shape notebook by abarciauskas-bgse · Pull Request #31 · developmentseed/cloud-optimized-geospatial-formats-guide · GitHub

Thanks in advance.

dcherian · August 23, 2023, 4:29am

Never forget the classics! Chunking Data: Choosing Shapes : Unidata Developer's Blog

There’s even Python code

cc @rsignell

rsignell · August 28, 2023, 8:36pm

That is a classic, but it’s useful to note that in blog post Russ Rew assumed that one would want extracting the entire time series at a specific grid cell to take the same amount of time as extracting the entire spatial domain at a specific time step.

In the dynamic chunk determination algorithm implemented in the PR by @jbusecke cited above you specify this ratio.

In the USGS HyTEST program when we want to support both use cases, we have been picking this ratio to be much bigger, for example 20: if it takes 1 second to load a map, it takes 20 seconds to load a time series.

The rationale is that for map users expect fast, since that’s traditionally the way most data is written, whether from sensors or models. But maybe slowing them down from something like 200ms to 1s load times is acceptable.

While for time series users, they are used to very poor performance (like 30 minutes, or failing), so for them, speeding them up to 20 s load times seems like a miracle!

briannapagan · April 19, 2024, 4:46pm

@aimeeb are there new links to your work above? Interested to review it!

aimeeb · April 19, 2024, 6:24pm

Ah yes, the repo moved.

So the PR still exists but it is stale: Add determine-chunk-shape notebook by abarciauskas-bgse · Pull Request #31 · cloudnativegeo/cloud-optimized-geospatial-formats-guide · GitHub

Also here is a link to the notebook for easier readability: cloud-optimized-geospatial-formats-guide/zarr/determine-chunk-shape.ipynb at feat/determine-chunking · cloudnativegeo/cloud-optimized-geospatial-formats-guide · GitHub

Thank you for the nudge and I welcome any feedback. I will see if we can get this merged.

Topic		Replies	Views
Understanding optimal zarr chunking scheme for a climatology Science	6	2819	April 4, 2022
Extremely slow rechunking of Zarr store with xarray Data	16	3825	October 22, 2021
Am I thinking about this data processing/chunking workflow correctly? Data	8	1028	June 9, 2023
Feedback on Zarr performance benchmarking HPC	1	1144	July 16, 2020
Xarray slow read on cluster Data machine-learning	4	160	November 3, 2024

High-Level Guidance for Zarr chunking

Related topics