Using xhistogram to bin measurements at particular stations

Hi all,

I have a collection of Argo profiles, which as probably all of you know are collections of temperature and salinity measurements over depth at particular latitude/longitude points.

I’ve been using xhistogram to calculate, for example, mean surface temperatures over a selected lat/lon grid. This works really well, and I’m glad to have this tool.

However, some of my values are discrete (e.g. integer labels). I’d like to calculate quantities like the median and the mode over these discrete values. Is it possible to group my profiles based on which lat/lon bin they belong to, with the intention of calculating median and mode over those groups?

Apologies if I’ve missed something obvious. I’ve spent some time looking at the docs and experimenting, but I haven’t found a solution yet. Thanks in advance for any help or clarification you can provide. :slightly_smiling_face:


Thanks for the question Dan and welcome back to the forum! :hugs:

I am trying to understand what you aim to do, but I’m not quite there yet. Are you saying you’d like to apply a custom reduction using xhistogram, rather than just sum / mean?

Perhaps you could write some pseudo python code to express what you wish would work but does not actually work? That would give some more concreteness to the discussion.

1 Like

Thanks for your rapid reply and your warm greeting, Ryan! :blush: Essentially, yes - I’d like to calculate something with histogram other than the mean. I’ll give your suggestion a shot. At present, I have a DataArray that looks like this:

The profiles are unevenly distributed in latitude and longitude, and the labels are discrete values assigned by an unsupervised classification algorithm. I’d like to be able to calculate the most common label value in each 1°x1° bin on a lat-lon grid.

Perhaps it would look something like this:

import numpy as np
from xhistogram.xarray import histogram
import matplotlib.pyplot as plt

# define latitude and longitude bins
binsize = 1.0 # 1°x1° bins
lon_bins = np.arange(lon_min, lon_max, binsize)
lat_bins = np.arange(lat_min, lat_max, binsize)

# either 
indices_of_profiles_in_each_lat_lon_bin, hist = histogram(da.lon, 
                                                          bins=[lon_bins, lat_bins])

where indices_of_profiles_in_each_lat_lon_bin is a set of indices that tells me which profiles fall into each lat-lon grid cell. I could then select the profiles in each lat-lon grid cell and calculate the mode of the label values.

Even better would be a command that lets me simply do the following. (I’m totally making this code up, no idea if it makes sense…)

A = da.groupby(['lon_bins', 'lat_bins']).mode()

where mode returns the most frequently occurring label value in each lat-lon bin. The result A would be a 2D object that I could plot as follows:

plt.pcolormesh(lon_bins, lat_bins, A)

The result might be something like this, except that the values would be discrete instead of continuous:

I hope that makes sense - please do let me know if something is unclear. :thinking: Thanks again! :blush:

1 Like

Ok that is much more clear. Thanks for taking the time to write it up.

You’re correct that it is not possible today with xhistogram. The reason is that, as currently implemented, xhistogram relies heavily on the fact that it is easy to just sum up the bin counts from each block of data to reach the total for each. Sum is commutative and associative, so it is trivial to parallelize (and most of the code in xhistogram is about making things play well with dask).

FWIW, the algorithm itself is here and is not that complicated to read.

This is a very long-standing open issue in xarray:

I think that the Flox package by @dcherian supports it. Let’s see what Deepak has to say about this.

1 Like

Right! Okay, I see now. Thanks for the detailed reply.

Yes you can do this with flox but it won’t be fast.

I wrote an example for median here: Custom Aggregations - flox

Replace np.median with scipy.stats.mode for mode. This version doesn’t support dask but this kind of statistic is hard to do in parallel.

The documentation on this cool feature is atrocious. So please let me know if you have any questions. PRs to improve the notebook or documentation are also very welcome!

1 Like