Using Dask client and running out of memory

Atousa · June 8, 2023, 5:17pm

Hi Pangea support team,

I’ve been trying to analyze petabytes of model outputs. I’m working with a small portion of it (on the order of hundred Gb or ~1Tb). If I do calculations on the fly for the purpose of plotting, I have no issues. But if I want to evaluate a calculation and save it to netcdf, I run compute nodes (NASA NAS rom nodes) out of memory!
Here is a bit of description about the sample dataset:

I have 431 (#of days) files with size latlon= 20003499 for surface salinity. I use xr & dask to read the data and perform a very simple task of calculating the running mean. If I print one value of the output, it takes ~30 sec:

%%time
ds_anomaly.isel(lat=0, lon=0, date=0).Salt.values
CPU times: user 23.1 s, sys: 6.59 s, total: 29.7 s
Wall time: 29.7 s

First method:
If I want to save it to netcdf, or say ds_anomaly.values , it goes out of memory.

I tried to use dask client with:
from dask.distributed import Client
client = Client(n_workers=20, threads_per_worker=2, memory_limit=‘6GB’)
But I got errors such as the following:

minor: Can't open object

#004: H5Aint.c line 545 in H5A__open_by_name(): unable to load attribute info from object header
major: Attribute
minor: Unable to initialize object
#005: H5Oattribute.c line 494 in H5O__attr_open_by_name(): can’t locate attribute: ‘_QuantizeBitRoundNumberOfSignificantBits’
major: Attribute
minor: Object not found
Traceback (most recent call last):
File “/home3/asaberi1/scripts/analyzing_DYOMOND/ComputeMJO_Anom_ondailymean_daskclient.py”, line 53, in
ar_anomaly=ds_anomaly.Salt.values
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/xarray/core/dataarray.py”, line 738, in values
return self.variable.values
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/xarray/core/variable.py”, line 607, in values
return _as_array_or_item(self._data)
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/xarray/core/variable.py”, line 313, in _as_array_or_item
data = np.asarray(data)
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/dask/array/core.py”, line 1700, in array
x = self.compute()
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/dask/base.py”, line 314, in compute
(result,) = compute(self, traverse=False, **kwargs)
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/dask/base.py”, line 599, in compute
results = schedule(dsk, keys, **kwargs)
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/distributed/client.py”, line 3186, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/distributed/client.py”, line 2345, in gather
return self.sync(
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/distributed/utils.py”, line 349, in sync
return sync(
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/distributed/utils.py”, line 416, in sync
raise exc.with_traceback(tb)
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/distributed/utils.py”, line 389, in f
result = yield future
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/tornado/gen.py”, line 769, in run
value = future.result()
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/distributed/client.py”, line 2208, in _gather
raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: Attempted to run task (‘mean_chunk-overlap-mean_agg-aggregate-e632584298bee878364f934b6ef3b3cf’, 3, 0, 0) on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:37249. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see Why did my worker die? — Dask.distributed 2023.5.1 documentation.
2023-06-08 10:12:09,856 - distributed.nanny - WARNING - Restarting worker

2.Second method I try: to not use dask, and simply save to netcdf. A 23GB file gets saved which is completely “nan” array. So I wonder Is there any xr paramter related to “nan” that I need to pass for saving the data? If I plot the anomalies on the fly (or animate them) it looks fine, but when I save the anomaly file (if successfull, sometimes it goes out of memory), it writes out nan for all values. Again, when I calculate the mean and the anomaly with panda I specify skipna=True & the plots look right, but something goes wrong with writing the output in netcdf.

Any help is appreciated!
Atousa

Guy_Maskall · June 8, 2023, 11:03pm

Hi Atousa,

You’re a little light on details on the system you’re trying to run this on, but from what you’ve written above, it seems you’re trying to create a LocalCluster (from just calling Client(...) with 20 workers that are allocated 6 GB. Does the local system have > 120 GB RAM? I’m going to guess not…

I don’t see in the above anything about where you set your chunk sizes? Unless I missed it. In general, very broadly, if you have too small chunks, your scheduler will have a lot of work to do up front before the workers even get going. You’ll notice a process kick off and system memory use climb and climb. If that runs out of memory, your workers aren’t even going to get started. If you have too large chunks, the scheduler has an easier time of it, but you may exhaust the resources allocated to workers and workers will die and you won’t really get anywhere fast then either.

I may have underestimated your available resources, but do check the available RAM. I’d suggest try getting started with fewer workers (less ambitious than 20!) and so the total memory fits comfortably within RAM. Remember your OS and other running processes will want something! Watch your dask dashboard. And check the size of your chunks. You may want to adjust that.

In short, you want a good number of chunks to fit in a worker. You want the total memory of all workers to fit comfortably in RAM. And you want to find a balance between lots and lots of little jobs that give the scheduler a headache and may cause things to fail even before they get started vs fewer larger jobs that consume too much memory in each worker. And depending on the nature of your workload, you may want to try just having 1 thread per worker. I’ve found all this a bit of a black art in the past, but found it helpful to watch memory use as the scheduler kicks off and then watch the dashboard for workers being over taxed.

chase · June 9, 2023, 12:33am

I had a similar issue with a different problem see here. By monitoring the dask dashboard I could see that workers were consuming more memory than I stated for the memory limit.

As a test you could try reducing the memory limit and monitor the dashboard to see how much memory is actually being consumed, as Guy suggested.

Atousa · June 16, 2023, 2:52pm

Hi Guy and Chase, thanks for sharing your thoughts and comments.
To answer your question, I use NASA Ames rom nodes with 512 GB memory! How to Get More Memory for your PBS Job - HECC Knowledge Base (not sure if this website will load for your). So I don’t think I have lack of memory, but I’m not sure how to properly use the memory. I tried sing dask dashboard, but could not yet decipher what’s happening. Below, I’ve shared a few lines of code so that you get some better idea of what I’m doing, along with some errors. In terms of chunking, I figured the automatic chunking (in the time dimension) was the fastest. So not sure what else I could do. I also tried saving to zarr and that didn’t solve the problem either.

client = Client(n_workers=20,threads_per_worker=2, memory_limit=‘6GB’)

ds = xr.open_mfdataset(df_files.path.values.tolist(),combine=‘nested’, concat_dim=‘date’,engine=‘netcdf4’)

ds_runningmean = ds.rolling(date=30, center=True).construct(‘roll’).mean(‘roll’, skipna=True)

da_anomaly.to_netcdf(‘anom30d.nc’)

%%% errors %%%%%%%%%
response = await comm.read(deserializers=serializers) [1962/1962]
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/comm/tcp.py”, line 241, in read
convert_stream_closed_error(self, e)
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/comm/tcp.py”, line 142, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.class.name}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:35697 remote=tcp://127.0.0.1:49720>: ConnectionResetError: [Errno 104] Connect
ion reset by peer
2023-06-13 06:46:32,674 - distributed.worker.memory - WARNING - Worker is at 87% memory usage. Pausing worker. Process memory: 4.90 GiB – Worker memory lim
it: 5.59 GiB
2023-06-13 06:46:33,189 - distributed.worker.memory - WARNING - Worker is at 77% memory usage. Resuming worker. Process memory: 4.33 GiB – Worker memory lim
it: 5.59 GiB
2023-06-13 06:46:36,595 - distributed.worker.memory - WARNING - Worker is at 88% memory usage. Pausing worker. Process memory: 4.94 GiB – Worker memory lim
it: 5.59 GiB
2023-06-13 06:46:36,680 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:35697 (pid=18103) exceeded 95% memory budget. Restarting…
2023-06-13 06:46:37,041 - distributed.worker.memory - WARNING - Worker is at 77% memory usage. Resuming worker. Process memory: 4.31 GiB – Worker memory lim
it: 5.59 GiB
2023-06-13 06:46:37,848 - distributed.nanny - WARNING - Restarting worker
2023-06-13 06:46:38,040 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:44577 → tcp://127.0.0.1:35697
Traceback (most recent call last):
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/comm/tcp.py”, line 317, in write
raise StreamClosedError()
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/worker.py”, line 1761, in get_data
compressed = await comm.write(msg, serializers=serializers)
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/comm/tcp.py”, line 328, in write
convert_stream_closed_error(self, e)
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/comm/tcp.py”, line 144, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:44577 remote=tcp://127.0.0.1:51530>: Stream is closed
2023-06-13 06:46:38,726 - distributed.worker.memory - WARNING - Worker is at 87% memory usage. Pausing worker. Process memory: 4.91 GiB – Worker memory lim
it: 5.59 GiB
2023-06-13 06:46:38,779 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:44993 (pid=18092) exceeded 95% memory budget. Restarting…
2023-06-13 06:46:39,190 - distributed.diskutils - ERROR - Failed to remove ‘/var/tmp/pbs.16155203.pbspl1.nas.nasa.gov/dask-scratch-space/worker-e204c27v/stor
age/%28%27concatenate-803fb3ea98e9af4a0e91528d2720a1f5%27%2C%2021%2C%200%2C%200%29#33’ (failed in ): [Errno 2] No such file or dire
ctory: '%28%27concatenate-803fb3ea98e9af4a0e91528d2720a1f5%27%2C%2021%2C%200%2C%200%29#33

2023-06-13 06:46:39,210 - distributed.diskutils - ERROR - Failed to remove ‘/var/tmp/pbs.16155203.pbspl1.nas.nasa.gov/dask-scratch-space/worker-e204c27v/stor
age/%28%27concatenate-803fb3ea98e9af4a0e91528d2720a1f5%27%2C%20380%2C%200%2C%200%29#22’ (failed in ): [Errno 2] No such file or dir
ectory: ‘%28%27concatenate-803fb3ea98e9af4a0e91528d2720a1f5%27%2C%20380%2C%200%2C%200%29#22’
2023-06-13 06:46:39,262 - distributed.nanny - WARNING - Restarting worker
2023-06-13 06:46:39,582 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:37337 (pid=18133) exceeded 95% memory budget. Restarting…
2023-06-13 06:46:39,620 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:37337
Traceback (most recent call last):
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/tornado/iostream.py”, line 861, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/tornado/iostream.py”, line 1116, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/worker.py”, line 2035, in gather_dep
response = await get_data_from_worker(
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/worker.py”, line 2871, in get_data_from_worker
response = await send_recv(
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/core.py”, line 1124, in send_recv
response = await comm.read(deserializers=deserializers)
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/comm/tcp.py”, line 241, in read
convert_stream_closed_error(self, e)
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/comm/tcp.py”, line 142, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.class.name}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:40028 remote=tcp://127.0.0.1:37337>: Connec
tionResetError: [Errno 104] Connection reset by peer
2023-06-13 06:46:39,620 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:37337
Traceback (most recent call last):
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/comm/tcp.py”, line 225, in read
frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

Atousa · June 16, 2023, 2:57pm

One other I experienced is something similar to Exporting xarray dataset to netcdf results in all nans · Issue #2765 · pydata/xarray · GitHub. Which is that even if I’m successful in saving, a 23 GB file full of Nans gets saved!

Atousa · June 16, 2023, 3:56pm

I also tried increasing the memory
client = Client(n_workers=20, threads_per_worker=2, memory_limit=‘15GB’)

the last lines of error are:
Traceback (most recent call last):
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/worker.py”, line 2035, in gather_dep
response = await get_data_from_worker(
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/worker.py”, line 2871, in get_data_from_worker
response = await send_recv(
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/core.py”, line 1124, in send_recv
response = await comm.read(deserializers=deserializers)
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/comm/tcp.py”, line 241, in read
convert_stream_closed_error(self, e)
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/distributed/comm/tcp.py”, line 144, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:58720 remote=tcp://127.0.0.1:46119>: Stream is closed
2023-06-16 08:15:35,555 - distributed.nanny - WARNING - Worker process still alive after 3.199999389648438 seconds, killing
2023-06-16 08:15:35,556 - distributed.nanny - WARNING - Worker process still alive after 3.199999389648438 seconds, killing
2023-06-16 08:15:35,556 - distributed.nanny - WARNING - Worker process still alive after 3.1999995422363288 seconds, killing
2023-06-16 08:15:35,556 - distributed.nanny - WARNING - Worker process still alive after 3.1999995422363288 seconds, killing
2023-06-16 08:15:35,557 - distributed.nanny - WARNING - Worker process still alive after 3.1999995422363288 seconds, killing
2023-06-16 08:15:35,557 - distributed.nanny - WARNING - Worker process still alive after 3.199999694824219 seconds, killing
2023-06-16 08:15:35,557 - distributed.nanny - WARNING - Worker process still alive after 3.1999995422363288 seconds, killing
2023-06-16 08:15:35,557 - distributed.nanny - WARNING - Worker process still alive after 3.1999995422363288 seconds, killing

josephyang · June 17, 2023, 5:24pm

Might not be an answer that you’re looking for but an approach that helped us to be able to work with large GRIB or NetCDF datasets is to rely on command-line tools such as NCO (https://nco.sourceforge.net/), CDO (Overview - CDO - Project Management Service) and WGRIB2 (Climate Prediction Center - wgrib2: grib2 utility) as much as we can so that we don’t run into memory issues.

Guy_Maskall · June 20, 2023, 8:54am

Hi Atousa,

Looking at your code where you use xr.open_mfdataset(), I have a few comments:

I’m not too familiar with an open_mfdataset workflow and concatenating multiple files, as it seems you’re doing; I’m just scanning the docs here
I think this leads me to deduce that date is not a dimension in your input files and you’re wishing to make it so when stacking all those inputs together so you then have spatial coordinates and each input file as a separate date? I don’t know where it will get the actual date from, so I’m guessing it will just be an integer index and you then (try to) calculate rolling stats over 30 of them
You state you chunk in time dimension, but I don’t see you do that in the code

So what I see from your code, I think, is that you’re trying to load in a number of files and concatenate by a new date variable (dimension) and you’re not doing any chunking. In other words, you’re expecting your workers to load everything into memory. It’s no wonder they choke.

When you’re performing operations like this, calculating statistics over time, what you don’t want to do is chunk over time; you do want all the time data to be loaded in one. You do want to chunk over spatial dimensions, because one worker/task doesn’t need to know what’s happening in adjacent spatial cells.

I don’t know what dimensions you have in your data files, so maybe you do need to combine nested, maybe you don’t. If lon, lat, and date are dimensions in each datafile, you should just be able to combine by coords, I think.

If you chunk over lon and lat but not time, then you should be able to find a chunk size that works well for you, and you’re basically saying to each worker “Hey, have a spatial extent that no other worker needs, but load all the dates for that area in one go so you can efficiently calculate the rolling date statistics without having to do a bunch of IO/communication to fetch date data”.

(I’m not entirely sure what’s happening in your subsequent code example, where you increase the memory for each worker. I wonder if you’ve not closed the previously created client, or perhaps you’ve still got file handles on the data files you were trying to open?)

Atousa · June 22, 2023, 1:17am

Hi Guy,
Thanks for writing and sharing your thoughts again.

I have 431 files that each have the full grid coordinates and have one distinct time coordinate.
So basically the dimension is date: 431 lat: 2000 lon: 3499

I concatenate the files in time.
As you guessed, I add the date as coordinate (since it does not exist in the data files). I have to add the combine option as I am specifying the dimension at which it needs to be concatnataed (according to mfdataset documentation).
I tried the following chunking:
%%time
ds = xr.open_mfdataset(
df_files.path.values.tolist(),
combine=‘nested’,
concat_dim=‘date’,
chunks={‘lat’:-1,‘lon’:1,‘date’:-1},
)
CPU times: user 3.25 s, sys: 321 ms, total: 3.57 s
Wall time: 5.99 s
ds_runningmean = ds.rolling(date=30, center=True).construct(‘roll’).mean(‘roll’, skipna=True)
ds_anomaly = ds-ds_runningmean

I figured this chunking speed up the calculation.
However, as soon as I want to plot a snapshot of the anomaly field :
#test plot
fig, ax = plt.subplots()
ctr=ds_anomaly.Salt[0,:,:].plot.contourf(ax=ax,cmap=cmocean.cm.delta,levels=np.arange(-0.7,0.7,0.05))
fig.colorbar(ctr)
fig.show()

I get the following errors. It eventually creates a plot, but it’s slow and inefficient.

2023-06-21 13:37:54,190 - distributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
2023-06-21 13:37:58,421 - distributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
2023-06-21 13:38:02,946 - distributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
2023-06-21 13:38:09,048 - distributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
2023-06-21 13:38:12,946 - distributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
2023-06-21 13:39:51,420 - distributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
Exception ignored in: <function CachingFileManager.del at 0x5373ce0>
Traceback (most recent call last):
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/xarray/backends/file_manager.py”, line 249, in del
self.close(needs_lock=False)
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/xarray/backends/file_manager.py”, line 233, in close
file.close()
File “src/netCDF4/_netCDF4.pyx”, line 2622, in netCDF4._netCDF4.Dataset.close
File “src/netCDF4/_netCDF4.pyx”, line 2585, in netCDF4._netCDF4.Dataset._close
File “src/netCDF4/_netCDF4.pyx”, line 2029, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Not a valid ID
Exception ignored in: <function CachingFileManager.del at 0x51bd750>
Traceback (most recent call last):
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/xarray/backends/file_manager.py”, line 249, in del
Exception ignored in: <function CachingFileManager.del at 0x59441c0>
Traceback (most recent call last):
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/xarray/backends/file_manager.py”, line 249, in del
self.close(needs_lock=False)
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/xarray/backends/file_manager.py”, line 233, in close
Exception ignored in: <function CachingFileManager.del at 0x60b3da0>
Traceback (most recent call last):
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/xarray/backends/file_manager.py”, line 249, in del
file.close()
File “src/netCDF4/_netCDF4.pyx”, line 2622, in netCDF4._netCDF4.Dataset.close
File “src/netCDF4/_netCDF4.pyx”, line 2585, in netCDF4._netCDF4.Dataset._close
self.close(needs_lock=False)
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/xarray/backends/file_manager.py”, line 233, in close
self.close(needs_lock=False)
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/xarray/backends/file_manager.py”, line 233, in close
File “src/netCDF4/_netCDF4.pyx”, line 2029, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Not a valid ID
file.close()
File “src/netCDF4/_netCDF4.pyx”, line 2622, in netCDF4._netCDF4.Dataset.close
file.close()
File “src/netCDF4/_netCDF4.pyx”, line 2622, in netCDF4._netCDF4.Dataset.close
File “src/netCDF4/_netCDF4.pyx”, line 2585, in netCDF4._netCDF4.Dataset._close
File “src/netCDF4/_netCDF4.pyx”, line 2029, in netCDF4._netCDF4._ensure_nc_success
File “src/netCDF4/_netCDF4.pyx”, line 2585, in netCDF4._netCDF4.Dataset._close
RuntimeError: NetCDF: Not a valid ID
File “src/netCDF4/_netCDF4.pyx”, line 2029, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Not a valid ID
Exception ignored in: <function CachingFileManager.del at 0x544b3b0>
Traceback (most recent call last):
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/xarray/backends/file_manager.py”, line 249, in del
self.close(needs_lock=False)
File “/home3/asaberi1/home_pyenvs/py_ana2/lib/python3.9/site-packages/xarray/backends/file_manager.py”, line 233, in close
file.close()
File “src/netCDF4/_netCDF4.pyx”, line 2622, in netCDF4._netCDF4.Dataset.close
File “src/netCDF4/_netCDF4.pyx”, line 2585, in netCDF4._netCDF4.Dataset._close
File “src/netCDF4/_netCDF4.pyx”, line 2029, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Not a valid ID

If I want to generate several images in a time loop I get these errors and after a couple of images are generated, it stops! I don’t really understand what “RuntimeError: NetCDF: Not a valid ID” means. Do you have any clue?

Topic		Replies	Views
Optimizing climatology calculation with Xarray and Dask Science	33	3994	December 6, 2024
Xarray unable to allocate memory, How to "size up" problem Data location-uw	9	3056	July 27, 2023
Strange issue with .compute() of Dask array values Science	15	917	October 17, 2020
Question on DASK efficiency Data	21	1107	April 29, 2022
Reading larger than memory HDF data and writing concatenated xarray (or Zarr) dataset on HPC Data	13	2406	October 8, 2020

Using Dask client and running out of memory

Related topics