Hello all, I have a suite of about 300 zarr files that total to about 40 Tb that I’m looking for an efficient way to parallel process. The files are structured as [ensemble, time, lat, lon], where I’ve added the ensemble dimension to the zarrs so they can easily combine with xr.open_mfdataset(). This works well and gives you access to a very large well-structured dataset, but it seems to only work for slicing and dicing and doing simple operations. If you try to do something complex, such as groupby operations, calculating percentiles, or about anything that uses the xClim package it seems to overwhelm Dask. Specifically, the Dask graph gets incredibly complicated, to the point that it either takes forever to resolve or just crashes. I am using an HPC, so I’ve been able to get around this by using a SLURM job array with a Python script and having each instance of the script only deal with one or two zarrs at a time. This works amazingly well since you can leverage the HPC resources in tandem, but I wonder if there is a similar approach using Dask.SLURMCluster that could be used from a Jupyter Notebook. Perhaps the suite of zarrs could be processed in parallel using a Dask bag or delayed functions. Since it’s on an HPC, I’d want to parallelize it as much as possible, but I wonder if processing large zarr files in xarray within a Dask bag gets you right back in the same boat, very complicated Dask graphs. I don’t know if there is a Dask way to wall off each item in the bag (which is a bunch of xr.DataArray/DaskArray processing), or if it ultimately comes to creating a master Dask graph before any processing begins.
I’m sure the big data folks have run into this condition before, so I was wondering if there is any best practice advice here. Ideally, I’d like to dedicate 32-64 CPUs to each zarr to calculated xClim metrics and then kick off a bunch of SLURM jobs (either Dask.SLURMCluster or SLURM scripts from the command line). Using a SLURM job array from the command line isolates the processing of each zarr, so I could continue with that, but I was wondering if there was an efficient user-friendly way from Jupyter that might be easier for novices to use.
Thanks