Processing large (too large for memory) xarray datasets, and writing to netcdf

I am developing a spatially distributed hydrological model which can report output on an hourly time-step. The output grids can be large in terms of number of rows and columns; i.e., 5000 rows x 5000 columns is not unusual. So for each hourly time-step I need to write (or store in memory) a grid of 5000 x 5000 (dtype=float32). You can imagine this becomes an enormous amount of storage (in memory = not possible, or on disk) if the model is run for example for 20 years: 20 * 365 * 24 = 175,200 grids of 5000 rows x 5000 columns. This becomes even worse if I have 20 or more variables to report. Storing it in memory is not possible cause it already blows before one month is processed. Also, on top of the hourly reporting, I do like to aggregate the hourly grids to daily, monthly, and annual grids, and write this to netcdf as well.

First I thought I write each variable as a xr.Dataset to a netcdf file for each hourly time-step, and then open them again with xr.open_mfdataset after the files for one month have been written to disk, and then do some aggregation to days, months with it. However, this all results in out-of-memory issues. Also the writing of large datasets to netcdf takes a very long time.

Does anyone have any advice / suggestions on how-to do this in a smart / proper way? That would be much appreciated.

Cheers,
Wilco

2 Likes

I recommend you try to write to a single Zarr store, appending each timestep as you go.

Thanks. That works indeed way better.

1 Like

Awesome! It would be great if you could post a quick code snippet showing what you did! That way others can learn from your solution.

I have inserted the main snipped of the code that appends to zarr below. Although the appending works, it becomes very slow and almost unworkable after appending a few hundred files, especially considering there’s a few thousand files more to append. I guess the solution is therefore not optimal.

t = pd.date_range(pd.Timestamp(2001,1,1,0,0,0), pd.Timestamp(2001,1,31,23,0,0), freq=‘1H’)

j = 0
for tt in t:

if j == 0:
    dd.to_zarr(tempdir, append_dim='time', mode='a', encoding=encoding)
else:
    dd.to_zarr(tempdir, append_dim='time', mode='a')

j+=1
1 Like

I think there is more that can be done to optimize. For example, are you setting the chunks explicitly? That is usually the biggest performance variable with Zarr.

Ideally the time to append should remain constant as your dataset grows. It would be useful to see a snakeviz profile of a single .append call for a large j value. If you can make that profile and then share it here, we can try to help you optimize.

An alternative approach, if you know the total number of timesteps in advance, would be to initialize the whole Zarr store at the beginning with a lazy “template” dataset, i.e.

ds_template.to_zarr(tempdir, compute=False)

and then use the region argument to write, i.e. something like

for j in range(ntimesteps):
    dd.to_zarr(tempdir, region={'time': slice(j, j+1)}

This might be faster because it won’t actually open the Zarr store with Xarray each time. That’s what I think is happening with append.

This seems to be an interesting approach I’d like to test. However, I am slightly confused here as path and tempdir are not the same directory? Also, what does ds_template look like / contain?

I have tried the code below, but it gives me the error shown at the end which doesn’t make sense to me as they do have dimensions in common.

Vectors containing x- and y-coordinates for center of cells

model_x_grid = np.arange(xll + cellRes/2, xll + cellRes/2 + cols * cellRes, cellRes)
model_y_grid = np.arange(yll + cellRes/2, yll + cellRes/2 + rows * cellRes, cellRes)

tic = time.perf_counter()

Some date / time settings

times_units = ‘hours since 0001-01-01 00:00:00.0’
times_calendar = ‘proleptic_gregorian’

t = pd.date_range(pd.Timestamp(2001,1,1,0,0,0), pd.Timestamp(2001,12,31,23,0,0), freq=‘1H’)
dates = [datetime(2001,1,1,0,0,0)+ n*timedelta(hours=1) for n in range(len(t))]
dates = date2num(dates, units=times_units, calendar=times_calendar)

tic = time.perf_counter()

create dummy data array with the dimensions time, y, x

data = xr.DataArray(clone, dims=(‘y’, ‘x’), coords={‘x’: model_x_grid, ‘y’: np.flip(model_y_grid)})
data = data.expand_dims({‘time’: dates})

convert into template dataset

ds_template = xr.Dataset(dict(prec=data, tavg=data))

write empty template to zarr

ds_template.to_zarr(tempdir, compute=False)

for j in range(len(dates)):

tt = dates[j]
# Create dataArray for precipitation
data = xr.DataArray(clone, dims=('y', 'x'), coords={'x': model_x_grid, 'y': np.flip(model_y_grid)})
data = data.expand_dims({'time': [tt]})
data.attrs['long_name'] = 'precipitation amount (mm)'
data.attrs['standard_name'] = 'precipitation_amount'
data.attrs['units'] = 'kg m-2'
data.name = 'prec'
# Create dataArray for temperature
data1 = xr.DataArray(clone, dims=('y', 'x'), coords={'x': model_x_grid, 'y': np.flip(model_y_grid)})
data1 = data1.expand_dims({'time': [tt]})
data1.attrs['long_name'] = 'temperature (circ C)'
data1.attrs['standard_name'] = 'temperature'
data1.attrs['units'] = 'degrees Celsius'
data1.name = 'tavg'
# Combine the two into a dataset
dd = xr.Dataset(dict(prec=data, tavg=data1))
dd.attrs['Conventions'] = 'CF-1.8'
# Write to zarr using region and slicing
dd.to_zarr(tempdir, region={'time': slice(j, j+1)})

toc = time.perf_counter()
print(‘Run finished in %.2f minutes’ %((toc-tic)/60.))

ValueError: when setting region explicitly in to_zarr(), all variables in the dataset to write must have at least one dimension in common with the region’s dimensions [‘time’], but that is not the case for some variables here. To drop these variables from this dataset before exporting to zarr, write: .drop([‘x’, ‘y’])

1 Like

Totally a typo.:man_facepalming:Fixed in my comment above.

Thanks. But how about that ValueError I am getting? Don’t see what’s going wrong here as the dimension time is present in both. Is this related to how ds_template is created?

The problem is that there are other variables you are trying to write that don’t have time as a dimension: the coordinate variables x and y. You don’t want to re-write these every time you write a new timestep. Xarray is telling you how to solve it! Call

dd = dd.drop(['x', 'y'])

before calling dd.to_zarr().

Another thing to note: make sure ds_template uses dask-backed arrays for all of the data variables (with your desired chunk structure). Otherwise you will end up writing a lot of unnecessary data.

1 Like

Hi! I was reading through this because I am trying to do something similar.
I am trying to write an xarray dataset to a zarr file.
The dataset looks as follows:

<xarray.Dataset> Size: 55MB
Dimensions:      (x: 3710, y: 3710, time: 1)
Coordinates:
    spatial_ref  int64 8B 0
  * x            (x) float64 30kB -122.4 -122.4 -122.4 ... -121.4 -121.4 -121.4
  * y            (y) float64 30kB 41.34 41.34 41.34 41.34 ... 40.34 40.34 40.34
  * time         (time) datetime64[ns] 8B 2021-01-01

And I am initialising the template as follows:

times_without_first = precip["time"].values[1:] # Because 1st date of input data is skipped, as runoff needs to be initialised

    # Create a template dataset
    template = xr.Dataset(
        {
            "runoff": (
                ['x', 'y'],
                np.empty((len(smax.y), len(smax.x))),
            )
        },
        coords={"time": times_without_first, "y": smax.y, "x": smax.x},
    )
    template["runoff"][:] = np.nan 


    # Save template to Zarr
    zarr_path = save_dir / "tmp_snow_FH" / "runoff_daily_2021_TEST.zarr"
    template.to_zarr(zarr_path, compute=False, consolidated=True)

Now, when trying to execute:

        runoff = runoff.drop_vars(['spatial_ref', 'x', 'y'])
        runoff.to_zarr(zarr_path, region={"time": slice(t, t+1)})

I also get the ValueError:

ValueError: when setting region explicitly in to_zarr(), all variables in the dataset to write must have at least one dimension in common with the region's dimensions ['time'], but that is not the case for some variables here. To drop these variables from this dataset before exporting to zarr, write: .drop_vars(['runoff'])

However, I don’t wish to drop ‘runoff’ because that is the variable that I am trying to write to the zarr file. Would someone have any suggestions on how to fix this?

Thanks in advance! : )

Write it after you create your template and then drop it before writing the regions.