Using xhistogram to bin measurements at particular stations

Dan_Jones · March 30, 2022, 1:26pm

Hi all,

I have a collection of Argo profiles, which as probably all of you know are collections of temperature and salinity measurements over depth at particular latitude/longitude points.

I’ve been using xhistogram to calculate, for example, mean surface temperatures over a selected lat/lon grid. This works really well, and I’m glad to have this tool.

However, some of my values are discrete (e.g. integer labels). I’d like to calculate quantities like the median and the mode over these discrete values. Is it possible to group my profiles based on which lat/lon bin they belong to, with the intention of calculating median and mode over those groups?

Apologies if I’ve missed something obvious. I’ve spent some time looking at the docs and experimenting, but I haven’t found a solution yet. Thanks in advance for any help or clarification you can provide.

rabernat · March 30, 2022, 2:24pm

Thanks for the question Dan and welcome back to the forum!

I am trying to understand what you aim to do, but I’m not quite there yet. Are you saying you’d like to apply a custom reduction using xhistogram, rather than just sum / mean?

Perhaps you could write some pseudo python code to express what you wish would work but does not actually work? That would give some more concreteness to the discussion.

Dan_Jones · March 30, 2022, 3:08pm

Thanks for your rapid reply and your warm greeting, Ryan! Essentially, yes - I’d like to calculate something with histogram other than the mean. I’ll give your suggestion a shot. At present, I have a DataArray that looks like this:

The profiles are unevenly distributed in latitude and longitude, and the labels are discrete values assigned by an unsupervised classification algorithm. I’d like to be able to calculate the most common label value in each 1°x1° bin on a lat-lon grid.

Perhaps it would look something like this:

import numpy as np
from xhistogram.xarray import histogram
import matplotlib.pyplot as plt

# define latitude and longitude bins
binsize = 1.0 # 1°x1° bins
lon_bins = np.arange(lon_min, lon_max, binsize)
lat_bins = np.arange(lat_min, lat_max, binsize)

# either 
indices_of_profiles_in_each_lat_lon_bin, hist = histogram(da.lon, 
                                                          da.lat, 
                                                          bins=[lon_bins, lat_bins])

where indices_of_profiles_in_each_lat_lon_bin is a set of indices that tells me which profiles fall into each lat-lon grid cell. I could then select the profiles in each lat-lon grid cell and calculate the mode of the label values.

Even better would be a command that lets me simply do the following. (I’m totally making this code up, no idea if it makes sense…)

A = da.groupby(['lon_bins', 'lat_bins']).mode()

where mode returns the most frequently occurring label value in each lat-lon bin. The result A would be a 2D object that I could plot as follows:

plt.pcolormesh(lon_bins, lat_bins, A)

The result might be something like this, except that the values would be discrete instead of continuous:

I hope that makes sense - please do let me know if something is unclear. Thanks again!

rabernat · March 30, 2022, 3:16pm

Ok that is much more clear. Thanks for taking the time to write it up.

You’re correct that it is not possible today with xhistogram. The reason is that, as currently implemented, xhistogram relies heavily on the fact that it is easy to just sum up the bin counts from each block of data to reach the total for each. Sum is commutative and associative, so it is trivial to parallelize (and most of the code in xhistogram is about making things play well with dask).

FWIW, the algorithm itself is here and is not that complicated to read.

github.com

xgcm/xhistogram/blob/master/xhistogram/core.py

"""
Numpy API for xhistogram.
"""


import dask
import numpy as np
from functools import reduce
from collections.abc import Iterable
from numpy import (
    searchsorted,
    bincount,
    reshape,
    ravel_multi_index,
    concatenate,
    broadcast_arrays,
)

# range is a keyword so save the builtin so they can use it.
_range = range

This file has been truncated. show original

Dan_Jones:

Even better would be a command that lets me simply do the following. (I’m totally making this code up, no idea if it makes sense…)
A = da.groupby(['lon_bins', 'lat_bins']).mode()

This is a very long-standing open issue in xarray:

github.com/pydata/xarray

Support multi-dimensional grouped operations and group_over

opened 07:42PM - 18 Feb 15 UTC

shoyer

API design topic-groupby

Multi-dimensional grouped operations should be relatively straightforward -- the… main complexity will be writing an N-dimensional concat that doesn't involve repetitively copying data. The idea with `group_over` would be to support groupby operations that act on a single element from each of the given groups, rather than the unique values. For example, `ds.group_over(['lat', 'lon'])` would let you iterate over or apply to 2D slices of `ds`, no matter how many dimensions it has. Roughly speaking (it's a little more complex for the case of non-dimension variables), `ds.group_over(dims)` would get translated into `ds.groupby([d for d in ds.dims if d not in dims])`. Related: #266

I think that the Flox package by @dcherian supports it. Let’s see what Deepak has to say about this.

Dan_Jones · March 30, 2022, 3:40pm

Right! Okay, I see now. Thanks for the detailed reply.

dcherian · March 31, 2022, 4:02pm

Yes you can do this with flox but it won’t be fast.

I wrote an example for median here: Custom Aggregations - flox

Replace np.median with scipy.stats.mode for mode. This version doesn’t support dask but this kind of statistic is hard to do in parallel.

The documentation on this cool feature is atrocious. So please let me know if you have any questions. PRs to improve the notebook or documentation are also very welcome!

rlourenco · December 16, 2024, 10:11pm

It’s 2024, but I used it today with mode (for categorical aggregation), and it was pretty helpful (thanks, @dcherian )

dcherian · December 16, 2024, 10:37pm

mode is now built-in, you can use func="mode"

xarray-regrid recently added mode as a regridding method using flox to compute a histogram. You could use that approach for a faster mode

github.com

xarray-contrib/xarray-regrid/blob/1425e1353b4405b4ea734ef37f78cf4538a72949/src/xarray_regrid/methods/flox_reduce.py#L118


      
                  np.uint16,
                  np.int32,
                  np.uint32,
              ]
              for dtype in int_types:
                  if (a.max() <= np.iinfo(dtype).max) and (a.min() >= np.iinfo(dtype).min):
                      return dtype
              return np.int64
          
          
          def compute_mode(
              data: xr.DataArray,
              target_ds: xr.Dataset,
              values: np.ndarray,
              time_dim: str | None,
              fill_value: None | Any = None,
              anti_mode: bool = False,
          ) -> xr.DataArray:
              """Upsample the input data using a "most common label" (mode) approach.
          
              Args:

rlourenco · December 17, 2024, 12:32am

Great! Thanks for the update

Dan_Jones · December 28, 2024, 9:29pm

I am glad to hear it!

Topic		Replies	Views
Xarray.dataset.grouby_bins without squishing other dimensions	4	2008	August 10, 2021
Flox Groupby vs xhistogram Meta	2	305	August 2, 2023
Usage of xhistogram compared to np.digitize Science	1	485	April 10, 2021
Python taking statistics over each latitudes and longitudes pair Data	3	661	November 26, 2020
Pangeo Showcase: "Xarray's GroupBy, oh my!" Pangeo Showcase	1	571	November 14, 2024

Using xhistogram to bin measurements at particular stations

Related topics