Optimizing climatology calculation with Xarray and Dask

I am currently crunching through 20 years of random hourly data, chunked somewhat like the initial dataset, on my laptop (16GB RAM) !!!

with “map-reduce” and it works. It’s slow but memory is stable. What’s interesting is that groupby_nanmean-chunk (first row under Progress) isn’t fused with random_sample. But that initial blockwise reduction still is executed frequently enough for memory to be stable.



So I still think there’s something about the I/O that’s making things worse in the cloud. Maybe setting inline_array=True will help?

These are my software versions:

pandas: 1.3.5
xarray: main
numpy : 1.21.5
dask  : 2022.3.0

This is how I set up distributed:

from dask.distributed import Client

# Setup a local cluster.
# By default this sets up 1 worker per core
client = Client(memory_limit="3 GiB", threads_per_worker=1, n_workers=4)
client.cluster

so mimicking what I would do with netCDF in a HPC setting


EDIT: This did end up finishing though my laptop went to sleep in the middle:

2 Likes