Writing large Xarray datasets to NetCDF

Hi all, I’m working on creating a set of forcing files using HYCOM outputs for a regional ocean model simulation (using ROMS) and am running into some difficulties when I try and create monthly climatology inputs. I’m using xarray’s to_netcdf function for a large dataset (4-D arrays like temperature and salinity are ~4 GB each), and am receiving an error that reads “NetCDF: One or more variable sizes violate format constraints”. I can write the smaller data arrays (e.g. time and 3-D arrays) to the NetCDF file without issues, this only happens when I try and write the 4-D arrays. I’ve also played around with some of the other to_netcdf options without success.

ROMS only reads in NetCDF files, so I’m not sure that advice from previous posts about packaging the data using zarr instead would be beneficial. One thought that I had (which could be totally wrong) was that I really only need the data from these arrays for the 5 outermost grid cells where I’m nudging my model to the boundary conditions. Is it possible to then set all other, unneeded interior points that don’t influence the model nudging to zero and then compress the array before writing to avoid this issue? Do others typically just write smaller NetCDF files (i.e., daily) instead, or are there some other tools that can be used to write out large arrays? Thanks for any help, and sorry if this is a question that’s better suited to an Xarray forum instead.

It sounds like you’re running into some limitations based on the format (e.g NETCDF4, NETCDF3_CLASSIC, etc.) you’re writing to. If I recall correctly, xarray determines a default based upon the dependencies you have installed, but this can also be specified with arguments in to_netcdf as long as you have the necessary dependencies.

I suspect you don’t have the necessary dependencies for the NETCDF4 formats and therefor xarray is defaulting to an engine and format using one of the NETCDF3 formats with filesize limitations.

Do you have the “netCDF4” dependency installed? If not, I’d try installing with that.

3 Likes

Thanks so much for your reply! I had already imported netCDF4, but just tried also adding the h5netcdf package (and a few others just in case?). I’m still getting the same error unfortunately, below is just one example using the NETCDF4 format.

# Write dataset to climatology file
import netCDF4
import h5netcdf
import cftime
import Cython
import setuptools
import h5py

wclm_ds.to_netcdf(write_clm_file, mode='a', format='NETCDF4', engine='netcdf4')

Is there some other package that I’m missing that could be making xarray default to an engine that’s unable to write this?

One more thought.

It looks like you have the mode set to append “a” rather than write “w”. Could the file you’re appending to be in a netCDF 3 format?

The “NETCDF4” format shouldn’t have those size constraints.

Hopefully this helps :crossed_fingers:.

2 Likes

Ah! Yes, it looks like I did initially create the NetCDF file that write_clm_file is based on using format='NETCDF3_CLASSIC'. I’m trying to rework this now, but unfortunately have run into another issue.
I’ve just recreated two versions of my grid file using the commands below:

nc = netCDF4.Dataset(filename, 'w', format='NETCDF4_CLASSIC')
nc = netCDF4.Dataset(filename, 'w', format='NETCDF4')

However, now when I try and append something else to a copy of this file (using either the NETCDF4 or NETCDF4_CLASSIC version):

# Write dataset to nudging layer file
import netCDF4

wds.to_netcdf(write_nudge_file, mode='a', format='NETCDF4_CLASSIC', engine='netcdf4')

I get an error message like that shown below (this is just a snippet of what I get back):

File /qfs/people/hins978/mambaforge/envs/mamba_env1/lib/python3.10/site-packages/xarray/backends/file_manager.py:198, in CachingFileManager.acquire_context(self, needs_lock)
    195 @contextlib.contextmanager
    196 def acquire_context(self, needs_lock=True):
    197     """Context manager for acquiring a file."""
--> 198     file, cached = self._acquire_with_cache_info(needs_lock)
    199     try:
    200         yield file

File /qfs/people/hins978/mambaforge/envs/mamba_env1/lib/python3.10/site-packages/xarray/backends/file_manager.py:216, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
    214     kwargs = kwargs.copy()
    215     kwargs["mode"] = self._mode
--> 216 file = self._opener(*self._args, **kwargs)
    217 if self._mode == "w":
    218     # ensure file doesn't get overridden when opened again
    219     self._mode = "a"

File src/netCDF4/_netCDF4.pyx:2353, in netCDF4._netCDF4.Dataset.__init__()

File src/netCDF4/_netCDF4.pyx:1963, in netCDF4._netCDF4._ensure_nc_success()

OSError: [Errno -101] NetCDF: HDF error: b'/people/hins978/model_runs/run_v11/CAroms_sponge_v1_nc4classic.nc'

I don’t think that my file is corrupted since I can open it just fine using xarray, I used the nc.close command when I finished creating it, and I can see all the variables using ncdump. Is there another way of checking whether or not it’s still accessible when appending data to it? Thanks again for the help!

Hm, I think that generally indicates that the file read is failing for some reason (a corrupt file being one, but there might be others as well - in this case maybe a file lock?).

It looks like you might be mixing use of the netCDF package directly and as a backend for xarray. This might be causing the issue. I’d try to stick with one if possible.

If you’re looking for full xarray examples, the IO section of the user guide has a few: Reading and writing files

Also, and I should have pointed you to this sooner, but I think in general you might get more (and more knowledgable) eyes on your usage questions in the GitHub discussions for xarray: pydata/xarray · Discussions · GitHub. Hopefully someone will chime in if this is not the case.

2 Likes

Thanks so much, I was finally able to get it to work! My error was in first opening the same netcdf file that I was trying to append to, which I assume is what caused some kind of file lock issue. I also realized that after appending to my grid netcdf file, the format was changed from netcdf4 or netcdf4-Classic to 64-bit offset. I changed this using the ncks --fl_fmt tool, and then my code below worked well.

# Write dataset to climatology file
import netCDF4
import h5netcdf

wclm_ds.to_netcdf(write_clm_file, mode='a', format='NETCDF4', engine='netcdf4')

I really appreciate your help in figuring this out, this will save me a ton of time! Thanks as well for the links to other examples and forums.

2 Likes

Glad you got it working!