Thrilled to be a part of the pangeo forum, my first post here!
I am using a numerical model which generates HDF format output. Since I am running an ensemble, therefore, each simulation contains N x M output files (where N is the number of ensemble members and M are the time steps at which each ensemble member dumps an output file).
As an example, my latest simulation involves N = 39 and M = 47 (NxM = 1833 files in total). My aim is to generate one xarray Dataset per each ensemble member (i.e., 39 netcdf files) with the liberty to concatenate those 39 files to generate a mega netcdf file containing all data.
My attempts so far:
I could succeed in deploying dask distributed client on my university HPC account to save netcdf file for all 1833 files which could be concatenated subsequently to generate one zarr object per ensemble member. While this solves the problem, I believe there definitely exists a better method to achieve the intended result by AVOIDING the middle step of saving netcdf files for each output file.
The second approach closely follows xarray documentation provided at this link (h/t Deepak Cherian).
Here’s a minimal example (running my python code from within a jupyter notebook):
tasks = [dask.delayed(arps_read_xarray_cf)(f, **open_kwargs) for f in files] # testing files for only one ensemble member (47 in number) datasets = dask.compute(tasks,scheduler=client) #dask performs computations (dashboard shows 47 processes in memory but ipynb kernel dies
Here is a relevant screenshot:
Since I am a new user, I cannot share more than one media file. I’ll try writing down rest of the story here: In the dask dashboard, I see that dask workers processed all the tasks and ideally my jupyter notebook should have returned the xarray datasets into ‘datasets’ variable. Unfortunately, it doesn’t happen and the notebook shows busy status until the kernel dies.
Has anybody faced a similar problem or found a workaround?
dask.compute(tasks) indeed returns an xarray dataset but `dask.compute(tasks) fails miserably producing the errors messages shown above, eventually crashing the ipynb kernel.