Optimizing climatology calculation with Xarray and Dask

dcherian · May 12, 2022, 1:26am

I am currently crunching through 20 years of random hourly data, chunked somewhat like the initial dataset, on my laptop (16GB RAM) !!!

with “map-reduce” and it works. It’s slow but memory is stable. What’s interesting is that groupby_nanmean-chunk (first row under Progress) isn’t fused with random_sample. But that initial blockwise reduction still is executed frequently enough for memory to be stable.

So I still think there’s something about the I/O that’s making things worse in the cloud. Maybe setting inline_array=True will help?

These are my software versions:

pandas: 1.3.5
xarray: main
numpy : 1.21.5
dask  : 2022.3.0

This is how I set up distributed:

from dask.distributed import Client

# Setup a local cluster.
# By default this sets up 1 worker per core
client = Client(memory_limit="3 GiB", threads_per_worker=1, n_workers=4)
client.cluster

so mimicking what I would do with netCDF in a HPC setting

EDIT: This did end up finishing though my laptop went to sleep in the middle:

Topic		Replies	Views
Zarr era5 reading causes huge number of tasks Cloud	9	1427	September 22, 2021
Xarray unable to allocate memory, How to "size up" problem Data location-uw	9	3190	July 27, 2023
Question on DASK efficiency Data	21	1125	April 29, 2022
Strange issue with .compute() of Dask array values Science	15	937	October 17, 2020
Using Dask client and running out of memory Science	8	4347	June 22, 2023

Optimizing climatology calculation with Xarray and Dask

Related topics