Code hangs while saving dataset to disk using .to_netcdf()

sudhansu-s-rath · August 8, 2024, 4:37pm

Hi, All

I am trying to save some datasets to the cluster directory(~ 20 GB size).

ds11.to_netcdf(‘/data/keeling/a/sudhansu/i/file.nc’)

The code keeps on running even after the file of approx ~ 20 GB is created on disk.

If I interrupt the code I can read the saved file, but the code does not stop even after hours.

I have enough disk space to save the file.

Any help with this will be much appreciated.

norlandrhagen · August 8, 2024, 5:37pm

Hey @sudhansu-s-rath!

Would writing your dataset to Zarr be an option?


import xarray as xr
from distributed import Client

# start a dask cluster
client = Client(n_workers=8)
client

#... ds11 = xr.open_dataset(..., chunks={})

ds11.to_zarr(<'dataset_name.zarr'>, consolidated=True)

sudhansu-s-rath · August 12, 2024, 2:18pm

I tried both NetCDF and Zarr format to save the data using dask clients too,
It’s still the same issue, the code does not stop and keeps on running. I saved precip data with the same procedure earlier (~117 GB for each model file). I guess the problem is with 3d pressure level variables.
If any one want to reproduce the error:

I am trying to save MIROC6_historical gs://cmip6/CMIP6/CMIP/MIROC/MIROC6/historical/r1i1p1f1/6hrLev/ta/gn/v20191114/

after reading this above as ds1, I interpolate to get one pressure level

import geocat.comp as gc

ta = ds1.ta

ps = ds1.ps

hyam = ds1.a

hybm = ds1.b

p0 = ds1.p0

new_levels = np.array([85000])

ta_850 = gc.interpolation.interp_hybrid_to_pressure(ta, ps, hyam, hybm, p0, new_levels=new_levels,lev_dim=None, method=‘linear’, extrapolate=False, variable=None, t_bot=None, phi_sfc=None)

Now I want to save the dataset to my local cluster.
ta_850.to_zarr(‘/path/miroc6.zarr’, consolidated=True)
or
ta_850.to_netcdf(‘/path/miroc6.nc’)

Any help in solving this will be appreciated. Thanks in advance

rabernat · August 12, 2024, 2:45pm

Are you sure this is a problem with saving to disk? Your code basically does three things:

Reads data from google cloud
Does the interpolation (gc.interpolation.interp_hybrid_to_pressure)
Writes it to local disk

Because you’re using Dask and operating lazily, all three of these things happen at once when you call to_zarr or to_netcdf. For netcdf, it may allocate disk space for the file at the beginning, but it may still be computing the actual data for a long time after that initial step.

To understand what’s going wrong, you might want to break it down into individual steps. For example:

Try just reading data and not saving it. E.g. call ta.mean().compute(). If this is slow, it means that your network connection doesn’t have enough bandwidth to pull all of this data down to your local machine efficiently. (Btw, how big is the dataset in question?)
If that works okay, then bring in the interpolation, but without saving, e.g. ta_850.mean().compute(). Perhaps this interpolation routine is very slow?
Finally, if that works okay, move on to saving.

I’m guessing you will find the problem before hitting step 3.

sudhansu-s-rath · August 16, 2024, 4:04pm

Thank you, Ryan,
After your suggestion, I figured out that the interpolation takes too much time, I created a script to do the same thing in my campus cluster and submitted the job. It took 20 hours to complete the process for one parameter(ta) with a time stamp of 94964. But when I run the same thing for ua and va, it continues running after more than 48 hours. I am not sure what exactly is taking the time here more than it took for ta.
PS: For ta, ua, and va the model and time period are exactly the same.

rabernat · August 16, 2024, 7:15pm

Sounds like you should submit an issue to the geocat github:

TomNicholas · August 16, 2024, 7:44pm

This looks potentially relevant - you should show your failing example there.

github.com/NCAR/geocat-comp

interp_hybrid_to_pressure failing on large datasets

opened 07:42PM - 17 Jul 24 UTC

kafitzgerald

bug support scalability and performance CUPiD

Noting this here for tracking purposes. While some performance improvements w…ere made in #592, we've gotten another report about problems with `interp_hybrid_to_pressure`. In particular, it's failing while operating on a larger dataset on Casper. I've been able to replicate the issue and it seems like it has to do with a very large Dask task graph. I'm guessing it's a combination of our internal usage of Dask in geocat-comp and the size of the dataset. We didn't see this failure while testing before because the dataset was much smaller and indeed when you subset the dataset temporally and run the function again it executes successfully. I've followed up with a temporary workaround, but it'd be good to prioritize this. This function gets a good bit of use and it's likely to come up again. I also suspect there's still a lot of performance improvements to be made here.

sudhansu-s-rath · August 19, 2024, 2:04pm

Thank you @TomNicholas @rabernat

I have posted this in Geocat Git Hub. Hoping to get a solution.

rabernat · August 19, 2024, 3:12pm

You could try using XGCM instead:

https://xgcm.readthedocs.io/en/latest/transform.html#id1

Topic		Replies	Views
Using Dask client and running out of memory Science	8	4220	June 22, 2023
Saving larger-than-memory objects to zarr using dask and xarray Data zarr	9	674	December 3, 2024
Extremely slow xarray/zarr writes Data	5	573	August 22, 2024
Slow Zarr to Netcdf Data	4	649	April 7, 2021
Xarray to Zarr Parallel Writes with Dask Distributed Data	8	3754	July 26, 2022

Code hangs while saving dataset to disk using .to_netcdf()

Related topics