Code hangs while saving dataset to disk using .to_netcdf()

rabernat · August 12, 2024, 2:45pm

Are you sure this is a problem with saving to disk? Your code basically does three things:

Reads data from google cloud
Does the interpolation (gc.interpolation.interp_hybrid_to_pressure)
Writes it to local disk

Because you’re using Dask and operating lazily, all three of these things happen at once when you call to_zarr or to_netcdf. For netcdf, it may allocate disk space for the file at the beginning, but it may still be computing the actual data for a long time after that initial step.

To understand what’s going wrong, you might want to break it down into individual steps. For example:

Try just reading data and not saving it. E.g. call ta.mean().compute(). If this is slow, it means that your network connection doesn’t have enough bandwidth to pull all of this data down to your local machine efficiently. (Btw, how big is the dataset in question?)
If that works okay, then bring in the interpolation, but without saving, e.g. ta_850.mean().compute(). Perhaps this interpolation routine is very slow?
Finally, if that works okay, move on to saving.

I’m guessing you will find the problem before hitting step 3.

Topic		Replies	Views
Saving larger-than-memory objects to zarr using dask and xarray Data zarr	9	579	December 3, 2024
Reading larger than memory HDF data and writing concatenated xarray (or Zarr) dataset on HPC Data	13	2404	October 8, 2020
Slow Zarr to Netcdf Data	4	646	April 7, 2021
Processing large (too large for memory) xarray datasets, and writing to netcdf Science	12	7229	December 12, 2024
Memory requirements tor converting a netcdf multifile dataset to zarr Data	3	839	May 18, 2022

Code hangs while saving dataset to disk using .to_netcdf()

Related topics