Code hangs while saving dataset to disk using .to_netcdf()

Are you sure this is a problem with saving to disk? Your code basically does three things:

  • Reads data from google cloud
  • Does the interpolation (gc.interpolation.interp_hybrid_to_pressure)
  • Writes it to local disk

Because you’re using Dask and operating lazily, all three of these things happen at once when you call to_zarr or to_netcdf. For netcdf, it may allocate disk space for the file at the beginning, but it may still be computing the actual data for a long time after that initial step.

To understand what’s going wrong, you might want to break it down into individual steps. For example:

  1. Try just reading data and not saving it. E.g. call ta.mean().compute(). If this is slow, it means that your network connection doesn’t have enough bandwidth to pull all of this data down to your local machine efficiently. (Btw, how big is the dataset in question?)
  2. If that works okay, then bring in the interpolation, but without saving, e.g. ta_850.mean().compute(). Perhaps this interpolation routine is very slow?
  3. Finally, if that works okay, move on to saving.

I’m guessing you will find the problem before hitting step 3.

4 Likes