Understanding optimal zarr chunking scheme for a climatology

rwegener2 · March 22, 2022, 2:53pm

Hi Pangeo people

I am fairly new to working with zarr and I am trying to understand how to pick an optimal chunking scheme for my use case of calculating a climatology.

I have been watching this fantastic talk by Deepak Cherian from the Pangeo Showcase and I understand that the recommendation is to create chunks which are large in space and small in time (ex. 1 chunk = 180 latitude, 360 longitude, 1 time).

To me, it seems like the opposite shape (ex. 1 chunk = 1 latitude, 1 longitude, all of time) would be faster because for a single latitude/longitude location you can calculate the climatology for every day of the year by accessing just a single chunk. I think Deepak’s talk explains this by saying that anything where you access multiple values per chunk (multiple groups per block, in the talk’s phrasing) is slow, but I am having a hard time understanding why that is.

Is anyone willing to 1) confirm that I am understanding how to chunk for a climatology correctly or 2) point out where I am misunderstanding optimal chunk shape? Thanks so much in advance!

Examples, if helpful, in the form of two cloud SST datasets:

OISST - large in space, small in time. What I understand to be the recommended chunking scheme for a climatology
https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa_oisst/v2.1-avhrr.zarr

MUR - small in space, large in time. What I would have thought was the best chunking scheme for a climatology
s3://mur-sst/zarr

rabernat · March 22, 2022, 3:17pm

Rachel, I agree with this intuition. However, the catch is that the standard way of coding the climatology operation in xarray, i.e.

ds.groupby('time.dayofyear').mean(dim='time')

may not lead to the optimal dask graph. You can verify this yourself by looking at the chunk structure and number of tasks.

An alternative is to use apply_ufunc or map_blocks. I explored this approach in the following notebook: Xarray Anomaly Calculations · GitHub

Curious what @dcherian has to say.

dcherian · March 23, 2022, 11:49am

The point of flox is to use strategies that work for the chunking that is present. But you are correct in general, it’s better to have bigger chunk sizes in time.

For MUR-type data ( a single chunk in time), use method="blockwise" with flox.xarray.xarray_reduce. It is identical to ryan’s map_blocks suggestion.

rabernat · March 23, 2022, 1:40pm

Thanks for the reply Deepak!

Rachel, it would be really interesting to see some benchmarks of the different approaches in terms how many dask tasks they produce and how long they take to run on a medium-sized dataset (not MUR!).

dcherian · March 23, 2022, 3:16pm

I agree. I’ve been putting it off for a few months now

I did type up some quick notes on trying to do the daily climatology for the OISST dataset: Strategies for climatology calculations - flox

I’m thinking we can build up a collection of such notes with some crowdsourcing.

rwegener2 · March 23, 2022, 9:14pm

Wow, thanks so much to you both! @rabernat I went through your whole notebook this morning and I feel much more comfortable with .map_blocks() and @dcherian that new flox docs page looks super informative.

I’ll read over the new flox page next, then it sounds like some benchmarking will be my next step. One of my takeaways here is that instead of trying to logic through how each chunking/compute approach would work, I should instead look at the chunk size and number of tasks in the dask graph. Through your examples I’m starting to get comfortable doing that.

I’ll be sure to re-post my benchmarking in case it becomes helpful for a set of crowsourced notes. Thanks again!

josephyang · April 4, 2022, 1:56am

FYI, this is how we (https://oikolab.com) chunk data (large in time, small in space) to store ERA5 data. Most of our users are looking for multiple years of data for a specific location or small region (e.g. a state in US or India) so this allows us to only fetch a small data volume for each API request.

A drawback with this is that constantly appending new data or replacing some data becomes a hassle and if the ‘time’ chunk is too large, fetching even a single day worth of data for a location is not terribly efficient. We’ve iterated quite a bit to find the chunk size that worked reasonably for the average use case.

Topic		Replies	Views
High-Level Guidance for Zarr chunking Data	4	904	April 19, 2024
Am I thinking about this data processing/chunking workflow correctly? Data	8	1034	June 9, 2023
Netcdf to Zarr best practices Data	13	10039	February 10, 2021
Welcome, I need some support for the design of a forecast archive with Zarr Data	10	1133	April 23, 2022
Extremely slow rechunking of Zarr store with xarray Data	16	3866	October 22, 2021

Understanding optimal zarr chunking scheme for a climatology

Related topics