Flox Groupby vs xhistogram

Ankur_Mahesh · August 2, 2023, 1:04am

The new groupby operations using flox and xarray for groupby operations is extremely useful for a variety of groupby operations! Thank you very much for making this tool available.

Flox + groupby can be used for a variety of different calculations. Specifically for calculating histograms along certain dimensions, is xhistogram still the recommended way to perform the calculation? I compared them in this gist: Comparing xhistogram vs flox · GitHub

Is there a reason that one method may scale better than the other on datasets that are even larger than the air temperature tutorial dataset?

TomNicholas · August 2, 2023, 5:03am

Welcome Ankur! Great question. For a detailed discussion of flox vs xhistogram see this xarray issue comment.

Is there a reason that one method may scale better than the other on datasets that are even larger than the air temperature tutorial dataset?

In that thread I think we tentatively concluded that the fact that flox uses pandas.cut whereas xhistogram uses the faster np.digitize is responsible for an absolute speed difference of 2-4x, but that this is fixable. Both xhistogram and flox should scale well to very large datasets. Flox is a more general tool, and it may be easier to express certain histogram operations using flox than using xhistogram (e.g. ).

is xhistogram still the recommended way to perform the calculation?

This problem of computing general multidimensional histograms on xarray objects at scale is sitting around waiting for someone keen to pick it up. Currently xhistogram is effective but not very well-maintained, and was written before @dcherian wrote flox. I personally would love to see a general implementation of histogramming added to xarray similar to how it supports groupby/resample etc… Basically someone would ideally close xarray #4610 by following an API like that proposed in xarray #5400 but using flox for the implementation. I started looking into it but I wish I could find the time to see it through

dcherian · August 2, 2023, 7:06pm

I’m not sure about the dask results, this problem is definitely too small for dask, and those chunk sizes are really tiny. It’s possible that flox builds a more convoluted dask graph that takes longer to optimize.

For pure numpy, they are very close now that Illviljan@ migrated flox to use digitize.

Topic		Replies	Views
Usage of xhistogram compared to np.digitize Science	1	485	April 10, 2021
November 17, 2021: flox: Fast & furious GroupBy reductions with Dask at Pangeo-scale Pangeo Showcase	0	1107	December 10, 2021
Using xhistogram to bin measurements at particular stations Science	9	1022	December 28, 2024
Xarray.dataset.grouby_bins without squishing other dimensions	4	2008	August 10, 2021
Pangeo Showcase: "Xarray's GroupBy, oh my!" Pangeo Showcase	1	571	November 14, 2024

Flox Groupby vs xhistogram

Related topics