Hi Pangea support team,
I’ve been trying to analyze petabytes of model outputs. I’m working with a small portion of it (on the order of hundred Gb or ~1Tb). If I do calculations on the fly for the purpose of plotting, I have no issues. But if I want to evaluate a calculation and save it to netcdf, I run compute nodes (NASA NAS rom nodes) out of memory!
Here is a bit of description about the  sample dataset:
I have 431 (#of days) files with size latlon= 20003499 for surface salinity. I use xr & dask to read the data and perform a very simple task of calculating the running mean. If I print one value of the output, it takes ~30 sec:
%%time
ds_anomaly.isel(lat=0, lon=0, date=0).Salt.values
CPU times: user 23.1 s, sys: 6.59 s, total: 29.7 s
Wall time: 29.7 s
First method:
If I want to save it to netcdf, or say ds_anomaly.values , it goes out of memory.
I tried to use dask client with:
from dask.distributed import Client
client = Client(n_workers=20, threads_per_worker=2, memory_limit=‘6GB’)
But I got errors such as the following:
minor: Can't open object
#004: H5Aint.c line 545 in H5A__open_by_name(): unable to load attribute info from object header
major: Attribute
minor: Unable to initialize object
#005: H5Oattribute.c line 494 in H5O__attr_open_by_name(): can’t locate attribute: ‘_QuantizeBitRoundNumberOfSignificantBits’
major: Attribute
minor: Object not found
Traceback (most recent call last):
File “/home3/asaberi1/scripts/analyzing_DYOMOND/ComputeMJO_Anom_ondailymean_daskclient.py”, line 53, in 
ar_anomaly=ds_anomaly.Salt.values
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/xarray/core/dataarray.py”, line 738, in values
return self.variable.values
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/xarray/core/variable.py”, line 607, in values
return _as_array_or_item(self._data)
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/xarray/core/variable.py”, line 313, in _as_array_or_item
data = np.asarray(data)
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/dask/array/core.py”, line 1700, in array
x = self.compute()
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/dask/base.py”, line 314, in compute
(result,) = compute(self, traverse=False, **kwargs)
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/dask/base.py”, line 599, in compute
results = schedule(dsk, keys, **kwargs)
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/distributed/client.py”, line 3186, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/distributed/client.py”, line 2345, in gather
return self.sync(
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/distributed/utils.py”, line 349, in sync
return sync(
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/distributed/utils.py”, line 416, in sync
raise exc.with_traceback(tb)
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/distributed/utils.py”, line 389, in f
result = yield future
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/tornado/gen.py”, line 769, in run
value = future.result()
File “/home3/asaberi1/home_pyenvs/py_analysis/lib/python3.9/site-packages/distributed/client.py”, line 2208, in _gather
raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: Attempted to run task (‘mean_chunk-overlap-mean_agg-aggregate-e632584298bee878364f934b6ef3b3cf’, 3, 0, 0) on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:37249. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see Why did my worker die? — Dask.distributed 2023.5.1 documentation.
2023-06-08 10:12:09,856 - distributed.nanny - WARNING - Restarting worker
2.Second method I try: to not use dask, and simply save to netcdf. A 23GB file gets saved which is completely “nan” array. So I wonder Is there any xr paramter related to “nan” that I need to pass for saving the data? If I plot the anomalies on the fly (or animate them) it looks fine, but when I save the anomaly file (if successfull, sometimes it goes out of memory), it writes out nan for all values. Again, when I calculate the mean and the anomaly with panda I specify skipna=True & the plots look right, but something goes wrong with writing the output in netcdf.
Any help is appreciated!
Atousa