Code hangs while saving dataset to disk using .to_netcdf()

Hi, All

I am trying to save some datasets to the cluster directory(~ 20 GB size).

ds11.to_netcdf(‘/data/keeling/a/sudhansu/i/file.nc’)

The code keeps on running even after the file of approx ~ 20 GB is created on disk.

If I interrupt the code I can read the saved file, but the code does not stop even after hours.

I have enough disk space to save the file.

Any help with this will be much appreciated.

Hey @sudhansu-s-rath!

Would writing your dataset to Zarr be an option?


import xarray as xr
from distributed import Client

# start a dask cluster
client = Client(n_workers=8)
client

#... ds11 = xr.open_dataset(..., chunks={})

ds11.to_zarr(<'dataset_name.zarr'>, consolidated=True)

I tried both NetCDF and Zarr format to save the data using dask clients too,
It’s still the same issue, the code does not stop and keeps on running. I saved precip data with the same procedure earlier (~117 GB for each model file). I guess the problem is with 3d pressure level variables.
If any one want to reproduce the error:

I am trying to save MIROC6_historical gs://cmip6/CMIP6/CMIP/MIROC/MIROC6/historical/r1i1p1f1/6hrLev/ta/gn/v20191114/

after reading this above as ds1, I interpolate to get one pressure level

import geocat.comp as gc

ta = ds1.ta

ps = ds1.ps

hyam = ds1.a

hybm = ds1.b

p0 = ds1.p0

new_levels = np.array([85000])

ta_850 = gc.interpolation.interp_hybrid_to_pressure(ta, ps, hyam, hybm, p0, new_levels=new_levels,lev_dim=None, method=‘linear’, extrapolate=False, variable=None, t_bot=None, phi_sfc=None)

Now I want to save the dataset to my local cluster.
ta_850.to_zarr(‘/path/miroc6.zarr’, consolidated=True)
or
ta_850.to_netcdf(‘/path/miroc6.nc’)

Any help in solving this will be appreciated. Thanks in advance

Are you sure this is a problem with saving to disk? Your code basically does three things:

  • Reads data from google cloud
  • Does the interpolation (gc.interpolation.interp_hybrid_to_pressure)
  • Writes it to local disk

Because you’re using Dask and operating lazily, all three of these things happen at once when you call to_zarr or to_netcdf. For netcdf, it may allocate disk space for the file at the beginning, but it may still be computing the actual data for a long time after that initial step.

To understand what’s going wrong, you might want to break it down into individual steps. For example:

  1. Try just reading data and not saving it. E.g. call ta.mean().compute(). If this is slow, it means that your network connection doesn’t have enough bandwidth to pull all of this data down to your local machine efficiently. (Btw, how big is the dataset in question?)
  2. If that works okay, then bring in the interpolation, but without saving, e.g. ta_850.mean().compute(). Perhaps this interpolation routine is very slow?
  3. Finally, if that works okay, move on to saving.

I’m guessing you will find the problem before hitting step 3.

4 Likes

Thank you, Ryan,
After your suggestion, I figured out that the interpolation takes too much time, I created a script to do the same thing in my campus cluster and submitted the job. It took 20 hours to complete the process for one parameter(ta) with a time stamp of 94964. But when I run the same thing for ua and va, it continues running after more than 48 hours. I am not sure what exactly is taking the time here more than it took for ta.
PS: For ta, ua, and va the model and time period are exactly the same.

Sounds like you should submit an issue to the geocat github:

This looks potentially relevant - you should show your failing example there.

Thank you @TomNicholas @rabernat

I have posted this in Geocat Git Hub. Hoping to get a solution.

1 Like

You could try using XGCM instead:

https://xgcm.readthedocs.io/en/latest/transform.html#id1