Xarray MemoryError: with groupby workloads

Julius makes a key observation. If the quantile is indeed calculated on a per year basis, then this is a much simpler problem.

First construct a dummy dataset

I’ve assumed that a month of daily data is in a single netCDF file:

import dask.array
import numpy as np
import xarray as xr

ds = xr.Dataset(
    {
        "pr": (
            ("time", "lat", "lon"),
            dask.array.random.random(
                (730487, 180, 360),
                chunks=(30, -1, -1),  # assuming one month per file
            ),
        )
    },
    coords={"time": xr.date_range("0001-01-01", "2000-12-31", freq="D")},
)
ds

Now rechunk to a yearly frequency

This is the key observation. Since you’re operating on a year of data at a time, you don’t need to rechunk to time=-1 instead you can rechunk so that a year of data is in a single chunk. This rechunking is required for computing the quantile.

# rechunk to yearly frequency
from xarray.groupers import TimeResampler

ds = ds.chunk({"time": TimeResampler(freq="AS")})
ds

^ EDIT: Use rechunking with a TimeResampler object.

Now do your groupby.apply

ds = ds.where(ds >= 1)
result = ds.groupby("time.year").apply(
    lambda group: group.where(group > group.quantile(q=0.95)).sum("time")
)
result

This works quite well!

result.compute()

dask

4 Likes