Julius makes a key observation. If the quantile is indeed calculated on a per year basis, then this is a much simpler problem.
First construct a dummy dataset
I’ve assumed that a month of daily data is in a single netCDF file:
import dask.array
import numpy as np
import xarray as xr
ds = xr.Dataset(
{
"pr": (
("time", "lat", "lon"),
dask.array.random.random(
(730487, 180, 360),
chunks=(30, -1, -1), # assuming one month per file
),
)
},
coords={"time": xr.date_range("0001-01-01", "2000-12-31", freq="D")},
)
ds
Now rechunk to a yearly frequency
This is the key observation. Since you’re operating on a year of data at a time, you don’t need to rechunk to time=-1
instead you can rechunk so that a year of data is in a single chunk. This rechunking is required for computing the quantile.
# rechunk to yearly frequency
from xarray.groupers import TimeResampler
ds = ds.chunk({"time": TimeResampler(freq="AS")})
ds
^ EDIT: Use rechunking with a TimeResampler object.
Now do your groupby.apply
ds = ds.where(ds >= 1)
result = ds.groupby("time.year").apply(
lambda group: group.where(group > group.quantile(q=0.95)).sum("time")
)
result
This works quite well!
result.compute()