Xarray MemoryError: with groupby workloads

chaithra · June 11, 2024, 9:07am

Hi,

I am working with piControl data from CMIP6 and handling 2000 years of daily data. While performing some analysis, I encountered memory error and am unable to write the result to a NetCDF file. The error message I received is as follows:

“MemoryError: Unable to allocate 176. GiB for an array with shape (730485, 180, 360) and data type float32”

To clarify, after the final calculation, there is only one value per year. Therefore, the final output shape would be (2000, 180, 360).

Any assistance or suggestions to resolve this issue would be greatly appreciated.

Thank you.

keewis · June 11, 2024, 9:40am

I suspect that’s because at some point during your calculations you accidentally load the entire data into memory.

To verify, try computing a single variable of the output shape (something like result["variable"].compute()). If this succeeds, you know that something goes wrong when writing to disk. However, if it fails (as I suspect), then that means that the issue is with the way you implemented your computation.

Either way, it would be great if you could post a (trimmed down) version of the code, as otherwise all we can do is guess.

chaithra · June 11, 2024, 10:27am

Please find the snippet of the code I used

chaithra · June 11, 2024, 10:30am

The calculation completes in 30 seconds until the yearly_r95ptot variable is calculated. The memory error occurs when I try to write the variable to a NetCDF file.

rabernat · June 11, 2024, 1:17pm

How much memory does your computer have?

chaithra · June 11, 2024, 2:11pm

The server has a total memory of around 251 GB.

jbusecke · June 11, 2024, 6:27pm

I think you need to calculate pn=... inside the lambda (along the lines of

... lambda group: group.where(group > group.quantile(...)).sum(...)

Right now it seems like you generate very large arrays due to broadcasting in the group > pn call?
I recommend inspecting each group manually, and I bet that group.where(group > pn) has the shape of the original array. In that case you would be trying to load size(wet_days) * n_years into memory.

jbusecke · June 11, 2024, 6:35pm

@chaithra could you provide the particular model, grid, and other facets from the CMIP output your are working with? This would help us to look for the same data in the cloud and test this.

Perhaps you could just print the datasets attributes from your file datasets object here?

dcherian · June 11, 2024, 9:42pm

Julius makes a key observation. If the quantile is indeed calculated on a per year basis, then this is a much simpler problem.

First construct a dummy dataset

I’ve assumed that a month of daily data is in a single netCDF file:

import dask.array
import numpy as np
import xarray as xr

ds = xr.Dataset(
    {
        "pr": (
            ("time", "lat", "lon"),
            dask.array.random.random(
                (730487, 180, 360),
                chunks=(30, -1, -1),  # assuming one month per file
            ),
        )
    },
    coords={"time": xr.date_range("0001-01-01", "2000-12-31", freq="D")},
)
ds

Now rechunk to a yearly frequency

This is the key observation. Since you’re operating on a year of data at a time, you don’t need to rechunk to time=-1 instead you can rechunk so that a year of data is in a single chunk. This rechunking is required for computing the quantile.

# rechunk to yearly frequency
from xarray.groupers import TimeResampler

ds = ds.chunk({"time": TimeResampler(freq="AS")})
ds

^ EDIT: Use rechunking with a TimeResampler object.

Now do your groupby.apply

ds = ds.where(ds >= 1)
result = ds.groupby("time.year").apply(
    lambda group: group.where(group > group.quantile(q=0.95)).sum("time")
)
result

This works quite well!

result.compute()

dask

chaithra · June 12, 2024, 5:16am

Model details: IPSL-CM6A-LR, piControl, r1i1p1f1.
I have regridded the original grid into a 1x1 grid.

chaithra · June 12, 2024, 7:01am

It’s working. Thank you! @dcherian
The detailed explanation has been immensely helpful for my learning process.

rabernat · June 12, 2024, 12:30pm

Amazing answer Deepak!

I think it’s worth asking–what could we do at the API level to make things “just work” more automatically? What @chaithra wants to do is pretty standard. Yet the tricks required to make it succeed are obscure and only obvious to experts.

chaithra · June 12, 2024, 1:00pm

Hi,

I would like to get your suggestions on improving the performance of my calculations.

To my understanding, your code uses yearly 95th percentile values as a threshold. I want to compute the climatological 95p values instead of the yearly values. Please see the attached snapshot. Unfortunately, the process is significantly slower than expected.

Could you suggest any methods to speed up this process? I would greatly appreciate any guidance or suggestions you could provide.

Thank you very much for your time and assistance.

dcherian · June 12, 2024, 2:45pm

What is “expected” here?

Your screenshot with chunksize=(182621,180, 360) is concerning. That’s only 4 chunks in the whole array, which limits the amount of parallelism you can use (with dask at least). Where is this chunksize coming from? How many days or months or years data are stored in a single netCDF file?

chaithra · June 12, 2024, 3:22pm

That screenshot was taken before applying chunks. Just to clarify, my chunk size is (365, 180, 360) as in your example.

A single NetCDF file from this model contains 500 years of daily data.

dcherian · June 12, 2024, 3:47pm

A single NetCDF file from this model contains 500 years of daily data.

Right. This is probably the issue. Specify chunks={"time": 365} in open_mfdataset. You should see increased parallelism after that.

chaithra · June 12, 2024, 5:54pm

Yes, it made the code work really fast. Thanks!

However, writing the result to netcdf seems a bit slow.

rabernat · June 13, 2024, 1:28pm

Try writing to Zarr instead!

dcherian · August 23, 2024, 4:48pm

With latest Xarray you can now rechunk to a frequency quite easily

from xarray.groupers import TimeResampler

ds = ds.chunk(time=TimeResampler(freq="AS"))

Michael_Sumner · August 24, 2024, 11:33pm

when does setting chunks get materialized? It confuses me that this gets set at read time or as a property of an object (I can’t find out how to do this properly for a multi-file input that I want different chunk sizes for in a Zarr I create). I expect it to be a property at write time, and that doesn’t seem to be the case. Is this described somewhere in examples? (I’m struggling with the documentation about what’s the right thing to do. Thanks!).

Topic		Replies	Views
Xarray unable to allocate memory, How to "size up" problem Data location-uw	9	3080	July 27, 2023
Xarray quantile fails on groupby over daily data!	11	136	June 22, 2025
Optimizing climatology calculation with Xarray and Dask Science	33	4032	December 6, 2024
Quartile Calculations on reshaped arrays - using dask + xarray + s3 and still having memory problems Data	2	615	May 31, 2023
Using Dask client and running out of memory Science	8	4226	June 22, 2023

Xarray MemoryError: with groupby workloads

First construct a dummy dataset

Now rechunk to a yearly frequency

Now do your groupby.apply

This works quite well!

Related topics