Hello
I’m pretty new to Dask and Xarray and want to use them to work with a dataset of 80GB. I have a machine with a single AMD Threadripper CPU with 32 cores and 128GB of RAM. The dataset contains measurements of n
HG impulses that I want to project onto two first Hermite-Gauss eigenfunctions.
While it does fit into memory, I need to calculate multiple projections of it, increasing the chunks size by 2 due to 2 eigenfunctions, as well as 11 displacements in time and 11 displacements in phase. Because of that, the chunk size increases 242 times and Dask runs out of memory. Here are the metadata of both DataArray
s
And the code I use for computation:
def cache_projections(cache_path=projections_cache_path):
# signs inverted due to conjugation
products = xr.concat(
[
ds.sel(ReIm="real") * ef.sel(ReIm="real")
+ ds.sel(ReIm="imag") * ef.sel(ReIm="imag"),
ds.sel(ReIm="real") * ef.sel(ReIm="imag")
- ds.sel(ReIm="imag") * ef.sel(ReIm="real"),
],
dim=COMPLEX_DIM,
)
# simple integration
projections = products.sum(dim="time") * DT
# rechunk n as I need to bootstrap the impulses later
projections = projections.chunk(
{
**dict(zip(projections.dims, np.repeat("auto", len(projections.dims)))),
"n": -1,
}
)
print("Calculating projections")
projections.to_dataset(name="sig").to_zarr(cache_path, mode="w")
The calculations are really slow and don’t fully utilize the CPU. I need advice on how to speed it up as I believe it may be caused by wrong chunking. In order not to run out of memory, I rechunked my original zarr array to have chunks of size 1MB instead of 128MB. I also tried increasing the number of workers to 2 in my LocalCluster, while rechunking ef
to chunk along the eigenfunctions. I saw a minor speedup, but it still takes over 20 hours to finish.
Is there something I’m doing wrong or that can be improved?