November 17, 2021: flox: Fast & furious GroupBy reductions with Dask at Pangeo-scale


Pangeo Showcase talk by Deepak Cherian, NCAR


I am a scientist studying ocean physics at the National Center for Atmospheric Research. I help with building and maintaining xarray; and contribute to the wider Pangeo ecosystem of software tools (such as xgcm and cf_xarray).


The “groupby” or the “split-apply-combine” paradigm is ubiquitous in scientific analysis, though it may be named differently e.g. “binning”, “histogramming”, “resampling”, “compositing”, or “climatology reductions”. Xarray implements the groupby paradigm through a “GroupBy” object. Historically the underlying algorithm is not dask-aware, and tends to fail disastrously with large Pangeo-scale distributed workflows. Here I present “flox”: a new package that explores effective strategies for groupby reductions at scale with dask. Ongoing work will plug this package in to xarray in a backwards-compatible manner, allowing the community to seamlessly benefit from significantly more efficient groupby computations.See for more.