Flox Groupby vs xhistogram

The new groupby operations using flox and xarray for groupby operations is extremely useful for a variety of groupby operations! Thank you very much for making this tool available.

Flox + groupby can be used for a variety of different calculations. Specifically for calculating histograms along certain dimensions, is xhistogram still the recommended way to perform the calculation? I compared them in this gist: Comparing xhistogram vs flox · GitHub

Is there a reason that one method may scale better than the other on datasets that are even larger than the air temperature tutorial dataset?

3 Likes

Welcome Ankur! Great question. For a detailed discussion of flox vs xhistogram see this xarray issue comment.

Is there a reason that one method may scale better than the other on datasets that are even larger than the air temperature tutorial dataset?

In that thread I think we tentatively concluded that the fact that flox uses pandas.cut whereas xhistogram uses the faster np.digitize is responsible for an absolute speed difference of 2-4x, but that this is fixable. Both xhistogram and flox should scale well to very large datasets. Flox is a more general tool, and it may be easier to express certain histogram operations using flox than using xhistogram (e.g. ).

is xhistogram still the recommended way to perform the calculation?

This problem of computing general multidimensional histograms on xarray objects at scale is sitting around waiting for someone keen to pick it up. :wink: Currently xhistogram is effective but not very well-maintained, and was written before @dcherian wrote flox. I personally would love to see a general implementation of histogramming added to xarray similar to how it supports groupby/resample etc… Basically someone would ideally close xarray #4610 by following an API like that proposed in xarray #5400 but using flox for the implementation. I started looking into it but I wish I could find the time to see it through :sweat_smile:

1 Like

I’m not sure about the dask results, this problem is definitely too small for dask, and those chunk sizes are really tiny. It’s possible that flox builds a more convoluted dask graph that takes longer to optimize.

For pure numpy, they are very close now that Illviljan@ migrated flox to use digitize.

2 Likes