Managing Parallel Dask Graphs

Hi all!

I’ve been working with @darothen and the Brightband team on ExtremeWeatherBench (EWB) for a bit over a year now. We’ve done a lot of great work and I’ve been able to really dial in parts of the code seeing years of discussion and (mostly) solved edge cases around Pangeo and Github.

There are a few critical tasks for us as we prepare a paper on the outputs, including a few leftover bugs post-AMS, but there’s been one overarching challenge that I personally am trying to accomplish for EWB, which is making the processing infrastructure as robust as possible. What I mean by that is, I want to make sure that the backend of EWB to go from datasets → evaluation results can balance memory, compute, i/o, and speed based on the machine used. It’s fun to throw it into a 1.5TB, 196 core GCP behemoth, but I also want to make sure it’s as efficient as possible on a Macbook Air, HPC, etc.

With that, we’ve used joblib to be the main interface for parallelization between distinct cases with some numba tooling in the background for derivations like MLCAPE and IVT. joblib is wonderful, but there’s some limitations we’ve had to build around (e.g. joblib’s workers do not release memory in an attempt to be more efficient during a delayed call).

Our workflow is essentially as follows:

  1. A user has a forecast dataset, ideally a zarr, locally or on a cloud store they can load
  2. They choose a target such as GHCNh data
  3. They use our events metadata which includes time and location for each case
  4. They then choose what metrics they want to use to evaluate
  5. Finally, they run the evaluation and get a resulting table of metrics broken down by initialization time, lead time, forecast source, target source, etc.

So, where the parallelization occurs is how many cases simultaneously get processed. As you can imagine, increasing the number of joblib jobs will rapidly increase memory usage and clog i/o, especially outside of a data center environment. I’m interested to get opinions from the community if there’s any nuance or approach here that may be even more efficient. Has anyone else successfully managed building an approach with independent parallel dask graphs?