How to save out-of-memory Xarray file to .csv

BEStecher · July 20, 2023, 7:05pm

I am trying to save a large Xarray dataset to csv, however, I have been struggling with memory issues and dataset formatting. Since my data is quite large, I have been utilizing Dask and opening up my netCDF files using:

xarray_data_ssp126 = xr.open_mfdataset('filename*.nc', combine='by_coords', parallel=True)

I do not want to use every single point in the data above, so below I am importing my desired lat/lon and converting it from .csv (gem_df) to xarray (gem):

gem = gem_df.to_xarray()

gem = gem.assign_coords(lat = ("lat", gem_df.lat), lon = ("lon", gem_df.lon))

I combined the gem dataset and the xarray_data_ssp126 dataset using interpolation with nearest neighbor:

data = xarray_data_ssp126.interp(lat = gem.lat, lon = gem.lon, method = 'nearest')

I then wanted to combine the ‘data’ dataset with gem to have some of my other data variables back.

data = data.combine_first(gem)

data

Which leaves me with the exact data I want (yay!), however, I have been struggling to find a way to export this data due to memory issues. I have read that you have previously suggested exporting to zarr, which did not work for me unfortunately. I tried to minimize the data by going from all United States data to just state level data, which did not work either. The data is daily and I am trying to save it in chunks of 25 years.

I have been able to do this previously when ignoring the coordinates (not sure if this is the right terminology) by using set_coords((“lat”, “lon”)) instead of the assign_coords listed above and instead of interpolating using select with the nearest neighbor method. By ignoring the coordinates I mean that when I did not assign the coordinates for ‘gem’, my data variables just said “index” instead of time, lat, lon.

I was wondering if you have any advice on how to go about saving these to csv?

rabernat · July 20, 2023, 7:31pm

Thanks for posting this interesting question.

I’m not convinced that this is actually the exact data you want. Based on your description, it looks like you want a dataset with 1121 different points in it (and 31411 timesteps); i.e. a 2D array. However, here you have a 3D array (31411 x 1121 x 1121); a subsample of your original datacube. This is 1121 times bigger than what you want, which could be why you are running out of memory.

You should probably be using Xarray’s “vectorized indexing” here:

For example, something like

lat_index = xr.DataArray(gem_df.lat, dims="point")
lon_index = xr.DataArray(gem_df.lon, dims="point")
data = xarray_data_ssp126.sel(lon=lon_index, lat=lat_index, method="nearest")

I’m still not certain that will work properly with Dask. But at least it should be the right data structure.

Try this out and report back here!

BEStecher · July 20, 2023, 9:42pm

Continuing the discussion from How to save out-of-memory Xarray file to .csv:

Hi!

First of all, thank you so much for replying!

I was able to get the code to work as you had suggested, and to clarify, I am looking for a dataframe that has the time, the surface wind speed value, and the coordinates (lat, lon). So yes, a 2d array I think! I was able to have the code work by implementing what you suggested:

lat_index = xr.DataArray(gem_df.lat, dims="point")
lon_index = xr.DataArray(gem_df.lon, dims="point")
data = xarray_data_ssp126.interp(lon=lon_index, lat=lat_index, method="nearest")
data = data.combine_first(gem)

Which results with:

I was wondering why not just replace ‘points’ with ‘index’ since they have the same number (and if it is okay to have that many dimensions)?

I am still facing the same issues regarding saving, and even trying to swtich it to a dask dataframe. I am trying to save in chunks of 25 years, so instead of 31411 time steps I just want 9131, which does not work either.

Unable to save to csv [more like unable to make it past to_dataframe()]
MemoryError: Unable to allocate 3.53 GiB for an array with shape (3652, 360, 720) and data type float32

and unable to save to dask dataframe:
MemoryError: Unable to allocate 102. PiB for an array with shape (14419187630906411,) and data type datetime64[ns]

BEStecher · July 26, 2023, 1:50pm

Hi all! I ended up being able to transfer my data to dask after using the .sel method instead of the .interp method. Once in dask, I was able to save to .csv. Not sure why this only worked for me using .sel instead of .interp, but I am happy it did

Topic		Replies	Views
Saving larger-than-memory objects to zarr using dask and xarray Data zarr	9	575	December 3, 2024
Processing large (too large for memory) xarray datasets, and writing to netcdf Science	12	7220	December 12, 2024
Best practice for memory management to iteratively write a large dataset with xarray Data	0	1368	December 4, 2021
Xarray unable to allocate memory, How to "size up" problem Data location-uw	9	3028	July 27, 2023
Code hangs while saving dataset to disk using .to_netcdf()	8	265	August 19, 2024

How to save out-of-memory Xarray file to .csv

Related topics