Optimizing climatology calculation with Xarray and Dask

dcherian · May 11, 2022, 8:11pm

Ah no wonder, this is bad for a map-reduce groupby. Because there is one element per group (i.e. one data point in each hour) per chunk, the blockwise reduction does nothing (input=output). Then we stitch 4 chunks together (memory use is now 4x chunksize at least), and reduce again (back to 1x chunksize). Now we keep repeating these steps till the end.

You could try “split-reduce” which is the standard xarray thing that would split each chunk to 24 new chunks and run it forward. This is probably too large an increase in tasks to work well.

I would call .chunk({"time": 6")} so 4x reduction in chunksize; and then use method="cohorts" so we can get some effective reductions early on in the graph, but obviously better if the zarr dataset was chunked that way to begin with.

Basically for time grouping where groups are periodic with period T, you want chunksize C > T and use “map-reduce”, or C < T and use “cohorts”. If C~T then it’s just bad memory wise (cna we call C/T the flocking number)

Topic		Replies	Views
Struggling with large dataset loading/reading using xarray Science	39	14665	February 16, 2023
Xarray unable to allocate memory, How to "size up" problem Data location-uw	9	2891	July 27, 2023
Very big memory load when using fast parallel file system HPC	11	1781	November 13, 2019
Xarray loading data locally when Dask is distributed Data	3	508	February 24, 2022
Reading larger than memory HDF data and writing concatenated xarray (or Zarr) dataset on HPC Data	13	2374	October 8, 2020

Optimizing climatology calculation with Xarray and Dask

Related topics