Xarray MemoryError: Unable to allocate memory

Hi,

I am working with piControl data from CMIP6 and handling 2000 years of daily data. While performing some analysis, I encountered memory error and am unable to write the result to a NetCDF file. The error message I received is as follows:

“MemoryError: Unable to allocate 176. GiB for an array with shape (730485, 180, 360) and data type float32”

To clarify, after the final calculation, there is only one value per year. Therefore, the final output shape would be (2000, 180, 360).

Any assistance or suggestions to resolve this issue would be greatly appreciated.

Thank you.

I suspect that’s because at some point during your calculations you accidentally load the entire data into memory.

To verify, try computing a single variable of the output shape (something like result["variable"].compute()). If this succeeds, you know that something goes wrong when writing to disk. However, if it fails (as I suspect), then that means that the issue is with the way you implemented your computation.

Either way, it would be great if you could post a (trimmed down) version of the code, as otherwise all we can do is guess.

Please find the snippet of the code I used

The calculation completes in 30 seconds until the yearly_r95ptot variable is calculated. The memory error occurs when I try to write the variable to a NetCDF file.

How much memory does your computer have?

The server has a total memory of around 251 GB.

I think you need to calculate pn=... inside the lambda (along the lines of

... lambda group: group.where(group > group.quantile(...)).sum(...)

Right now it seems like you generate very large arrays due to broadcasting in the group > pn call?
I recommend inspecting each group manually, and I bet that group.where(group > pn) has the shape of the original array. In that case you would be trying to load size(wet_days) * n_years into memory.

@chaithra could you provide the particular model, grid, and other facets from the CMIP output your are working with? This would help us to look for the same data in the cloud and test this.

Perhaps you could just print the datasets attributes from your file datasets object here?

Julius makes a key observation. If the quantile is indeed calculated on a per year basis, then this is a much simpler problem.

First construct a dummy dataset

I’ve assumed that a month of daily data is in a single netCDF file:

import dask.array
import numpy as np
import xarray as xr

ds = xr.Dataset(
    {
        "pr": (
            ("time", "lat", "lon"),
            dask.array.random.random(
                (730487, 180, 360),
                chunks=(30, -1, -1),  # assuming one month per file
            ),
        )
    },
    coords={"time": xr.date_range("0001-01-01", "2000-12-31", freq="D")},
)
ds

Now rechunk to a yearly frequency

This is the key observation. Since you’re operating on a year of data at a time, you don’t need to rechunk to time=-1 instead you can rechunk so that a year of data is in a single chunk. This rechunking is required for computing the quantile.

# rechunk to yearly frequency
chunks = tuple(
    xr.ones_like(ds.time, dtype=np.int32).groupby("time.year").count().data.tolist()
)
ds = ds.chunk({"time": chunks})
ds

We do not have nice syntax for this yet, but it would be a great contribution.

Now do your groupby.apply

ds = ds.where(ds >= 1)
result = ds.groupby("time.year").apply(
    lambda group: group.where(group > group.quantile(q=0.95)).sum("time")
)
result

This works quite well!

result.compute()

dask

3 Likes

Model details: IPSL-CM6A-LR, piControl, r1i1p1f1.
I have regridded the original grid into a 1x1 grid.

It’s working. Thank you! @dcherian
The detailed explanation has been immensely helpful for my learning process.

Amazing answer Deepak!

I think it’s worth asking–what could we do at the API level to make things “just work” more automatically? What @chaithra wants to do is pretty standard. Yet the tricks required to make it succeed are obscure and only obvious to experts.

Hi,

I would like to get your suggestions on improving the performance of my calculations.

To my understanding, your code uses yearly 95th percentile values as a threshold. I want to compute the climatological 95p values instead of the yearly values. Please see the attached snapshot. Unfortunately, the process is significantly slower than expected.

Could you suggest any methods to speed up this process? I would greatly appreciate any guidance or suggestions you could provide.

Thank you very much for your time and assistance.

What is “expected” here?

Your screenshot with chunksize=(182621,180, 360) is concerning. That’s only 4 chunks in the whole array, which limits the amount of parallelism you can use (with dask at least). Where is this chunksize coming from? How many days or months or years data are stored in a single netCDF file?

That screenshot was taken before applying chunks. Just to clarify, my chunk size is (365, 180, 360) as in your example.

A single NetCDF file from this model contains 500 years of daily data.

A single NetCDF file from this model contains 500 years of daily data.

Right. This is probably the issue. Specify chunks={"time": 365} in open_mfdataset. You should see increased parallelism after that.

Yes, it made the code work really fast. Thanks!

However, writing the result to netcdf seems a bit slow.

Try writing to Zarr instead!