Not sure if this is the right forum for this question but I’m giving it a try:
My biggest problem as a scientist (not computer) with xarray/dask is usually that the time saved by having minimal amounts of code to solve a science problem is often lost by trying to figure out how to scale it so it can actually run. Here is an example:
I want to generate a map of the maximum daily air temperature anomaly at each location from daily ERA5 data. The wonderfully simple code for this:
nyears=73
path="/data/axel0/era5/ncar/daily/inst/*/*.nc"
ds=xr.open_mfdataset(path,chunks{"time":366*nyears,"latitude":200,"longitude":200},parallel=True)
anom = ds.groupby("time.dayofyear")-ds.groupby('time.dayofyear').mean(dim='time')
maxv=anom.VAR_2T.max(dim='time')
There are 73 files (one for each year) with a single variable VAR_2T (+ coordinates) in the path
I get an error of “unable to allocate 102 GB” of an array shape of 26267,721,1440…with each year amounting to about 1.5GB of uncompressed data (3657211440*4)
Ok. So this apparently won’t fit into my machine with 96 GB and 16 Cores (though I don’t really understand why because I thought DASK would handle this by smartly sizing the problem into workable chunks?). So next step is to reduce the problem to only use files from 2010-2022 sort of works but is very slow (runs overnight).
So I thought I’ll try to speed this up a bit by setting up a local cluster and run this in parallel
client=Client(processes=False,memory_limit='10GB',n_workers=8)
(note I tried a distributed cluster (on local machine) also but that fails completely)
Nope, I get a lot of chatter and I can watch the entertaining (but inscrutable to me) DASK status monitor. But eventually some of the workers die (something but losing connection), I suppose because the machine gets overloaded (note that I’ve played around with chunk settings, and the resulting chunks seem to be different from what I am setting but I don’t seem to be able to get much improvement)
While a big problem, it certainly doesn’t seem to be too big for my system and I could break it down into smaller pieces (with and/or without use xarray) but the question is “should I really have to?”. Isn’t this what xarray was (also) made for?). It would be helpful if there was some guidance on “sizing” a problem, what to “expect” and maybe a set of recipes of things to try fit it. Maybe this exists or maybe I’m asking too much?
(this using Python 3.10, dask: 2022.12.1 and xarray 2022.12.0
Best
Axel