Thanks Ryan. This is very helpful! Tough, I think I have come away with my own rule of thumb: If Ryan needs to work through a number of iterations to make this work and finds it challenging, then who am I? (and better to resort to coding this in manageable chunks if I need to get something done rather than experimenting). There seem to be just way too many permutations of cluster configuration, chunking,e etc. to warrant the effort, unless it’s for learning purposes. Since I’ve invested that time and I’m curious, I might as well try to get a little further:
My data is on disk appears to chunked by “month” length in time and has no chunking on the lat,lon dimensions. Since I wrote the daily averages, I guess I could change that.
I did find one problem that makes a huge difference. A subset of the files had an additional variable “utc_date”.
ds=xr.open_mfdataset(path,parallel=True,engine='h5netcdf',drop_variables='utc_data')
now loads the entire 73 year data set fairly quickly. Your “benchmark” command
ds.mean(dim="time")
ds.compute()
finishes in about 2 minutes on 8 workers with 10GB each. No rechunking just letting xarray/dask do whatever it thinks best. That’s pretty cool.
daily_avg=ds.groupby('time.dayofyear').mean(dim='time').to_netcdf('daily_avg.nc')
Runs for about 25 minutes and looks almost finished but then locks up the system. Ok on to flox.
method = "cohorts"
tpm = flox.xarray.xarray_reduce(ds.VAR_2T, ds.time.dt.dayofyear, func="mean", method=method)
tpm.to_netcdf("daily_avg_flox.nc")
Finishes in 8 minutes. The flox documentation https://flox.readthedocs.io/en/latest/user-stories/climatology.html has some stuff about optimizing this further but this is good enough for me.
Now I am stuck on how I can calculate my anomalies though. When I do
ds=xr.open_mfdataset(path,parallel=True,engine='h5netcdf',drop_variables='utc_data')
tmp=xr.open_dataset('daily_avg_flox.nc')
anom=ds.groupby("time.dayofyear") - tpm
Executing this cell in my Notebook just shows the kernel as busy but the Dask monitor does’t show any tasks or any progress. Load on the system goes way up and the the total memory usage goes up to the maximum available on the system (96). Appears to be computing this without DASK delay? Totally confused now.