Hi all, I do some heavy processing with the xclim xarray package, but I’ve always been stumped on being able to use Dask cluster.adapt to scale up or down based on HPC availability. The xclim package allows me to parallelize over 1000+ CPUs, but I very often want to process a list of climate models. If I try to use Dask cluster.adapt(), it will ramp up the number of workers in the cluster, but once the currently processing model is complete, all the workers are released. In this situation you’re spending all the time going in and out of the HPC queue. At it’s core, I’m trying to do something highly parallel (xclim) inside a serial for loop (ie processing one climate model at a time). Trying to do everything fully in parallel likely explodes the task graph, so there needs to be some enforced serial processing.
I was wondering if anyone had some tips on how to approach this kind of workflow, while also being able to take advantage of cluster.adapt(). Perhaps there is a way to map over a Dask DataFrame with only one partition so the outer loop is serial, but doesn’t release the workers. A Slurm job array is a great way to process one model at a time using highly parallel xarray tasks, but I was wondering if it could easily be done with Dask cluster.adapt() so you could stay within a Jupyter Notebook. I usually use Dask cluster.scale() for these types of large jobs, but then you need to have a sense of how much wallclock time to request. Dask cluster.adapt() would let you run indefinitely and cycle workers in and out based on HPC availability, which sounds appealing.
Thanks