I am currently crunching through 20 years of random hourly data, chunked somewhat like the initial dataset, on my laptop (16GB RAM) !!!
with “map-reduce” and it works. It’s slow but memory is stable. What’s interesting is that groupby_nanmean-chunk (first row under Progress) isn’t fused with random_sample. But that initial blockwise reduction still is executed frequently enough for memory to be stable.
So I still think there’s something about the I/O that’s making things worse in the cloud. Maybe setting inline_array=True will help?
These are my software versions:
pandas: 1.3.5
xarray: main
numpy : 1.21.5
dask : 2022.3.0
This is how I set up distributed:
from dask.distributed import Client
# Setup a local cluster.
# By default this sets up 1 worker per core
client = Client(memory_limit="3 GiB", threads_per_worker=1, n_workers=4)
client.cluster
so mimicking what I would do with netCDF in a HPC setting
EDIT: This did end up finishing though my laptop went to sleep in the middle:


