Optimizing climatology calculation with Xarray and Dask

rabernat · May 11, 2022, 10:22pm

Update on the rechunked version

Rechunking Configuration

import zarr
import os
from rechunker import rechunk
import gcsfs

url = f"{os.environ['SCRATCH_BUCKET']}/ERA5_HiRes_Hourly.zarr"
zg = zarr.open_group(url)

fs = gcsfs.GCSFileSystem(skip_instance_cache=True, use_listings_cache=False)

temp_path = f"{os.environ['SCRATCH_BUCKET']}/ERA5_HiRes_Hourly_rechunk/temp.zarr"
target_path = f"{os.environ['SCRATCH_BUCKET']}/ERA5_HiRes_Hourly_rechunk/target.zarr"

temp_store = zarr.storage.FSStore(temp_path)
target_store = zarr.storage.FSStore(target_path)

target_chunks = {
    'tp': (7305, 103, 10),
    'time': None,
    'longitude': None,
    'latitude': None,
}

max_mem = '8GB'

r = rechunk(zg, target_chunks, max_mem, target_store, temp_store=temp_store)

# done with a 20 worker dask cluster w/ 40 GB memory each
r.execute()

This takes a long time (like an hour), but it is pretty stable and reliable.

Now do flox groupby with map-reduce

I open the data and specify even longer chunks in time such that the time axis is totally contiguous.

target_path = f"{os.environ['SCRATCH_BUCKET']}/ERA5_HiRes_Hourly_rechunk/target.zarr"
dsr = xr.open_dataset(
    target_path, engine="zarr", consolidated=False,
    chunks={'time': -1, 'latitude': 103, 'longitude': 10}
)

The groupby now executes flawlessly with map-reduce.

method = "map-reduce"
tpmr = flox.xarray.xarray_reduce(dsr.tp, dsr.time.dt.hour, 
                                func="mean", 
                                method=method)

I used the same configuration with 40GB of memory per worker, but the memory usage never rose above about 10GB per worker, meaning I could have got by much cheaper

This is what my dask task stream looked like. It took about 4 minutes to do 678 GB of data.

So I think this illustrates two key takeaways.

Rechunking is really, really helpful for solving these performance. That was the outcome of the epic discussion in Best practices to go from 1000s of netcdf files to analyses on a HPC cluster? which led to the creation of the rechunker pakage.
@dcherian’s flox package performs flawlessly when given properly chunked data.

Neither of these packages existed two years ago. I am tempted to say we have nailed it. However, I think there is more work to do to make rechunker faster and more memory efficient.

@fmaussion - you should try rechunking the data on your HPC system and see how it goes.

Topic		Replies	Views
Xarray unable to allocate memory, How to "size up" problem Data location-uw	9	3518	July 27, 2023
Zarr era5 reading causes huge number of tasks Cloud	9	1592	September 22, 2021
Question on DASK efficiency Data	21	1393	April 29, 2022
Best practices to go from 1000s of netcdf files to analyses on a HPC cluster? HPC	43	18454	January 8, 2025
Using Dask client and running out of memory Science	8	4556	June 22, 2023

Optimizing climatology calculation with Xarray and Dask

Rechunking Configuration

Now do flox groupby with map-reduce

Related topics