I am fairly new to working with zarr and I am trying to understand how to pick an optimal chunking scheme for my use case of calculating a climatology.
I have been watching this fantastic talk by Deepak Cherian from the Pangeo Showcase and I understand that the recommendation is to create chunks which are large in space and small in time (ex. 1 chunk = 180 latitude, 360 longitude, 1 time).
To me, it seems like the opposite shape (ex. 1 chunk = 1 latitude, 1 longitude, all of time) would be faster because for a single latitude/longitude location you can calculate the climatology for every day of the year by accessing just a single chunk. I think Deepak’s talk explains this by saying that anything where you access multiple values per chunk (multiple groups per block, in the talk’s phrasing) is slow, but I am having a hard time understanding why that is.
Is anyone willing to 1) confirm that I am understanding how to chunk for a climatology correctly or 2) point out where I am misunderstanding optimal chunk shape? Thanks so much in advance!
Examples, if helpful, in the form of two cloud SST datasets:
OISST - large in space, small in time. What I understand to be the recommended chunking scheme for a climatology https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa_oisst/v2.1-avhrr.zarr
The point of flox is to use strategies that work for the chunking that is present. But you are correct in general, it’s better to have bigger chunk sizes in time.
Rachel, it would be really interesting to see some benchmarks of the different approaches in terms how many dask tasks they produce and how long they take to run on a medium-sized dataset (not MUR!).
Wow, thanks so much to you both! @rabernat I went through your whole notebook this morning and I feel much more comfortable with .map_blocks() and @dcherian that new flox docs page looks super informative.
I’ll read over the new flox page next, then it sounds like some benchmarking will be my next step. One of my takeaways here is that instead of trying to logic through how each chunking/compute approach would work, I should instead look at the chunk size and number of tasks in the dask graph. Through your examples I’m starting to get comfortable doing that.
I’ll be sure to re-post my benchmarking in case it becomes helpful for a set of crowsourced notes. Thanks again!
FYI, this is how we (https://oikolab.com) chunk data (large in time, small in space) to store ERA5 data. Most of our users are looking for multiple years of data for a specific location or small region (e.g. a state in US or India) so this allows us to only fetch a small data volume for each API request.
A drawback with this is that constantly appending new data or replacing some data becomes a hassle and if the ‘time’ chunk is too large, fetching even a single day worth of data for a location is not terribly efficient. We’ve iterated quite a bit to find the chunk size that worked reasonably for the average use case.