I’m trying to use xarray.dataset.groupby_bins to get a 1x1 monthly ocean dataset (dimensions of lat, lon time) averaged into potential density bins. Because this is a Southern Ocean dataset, what I want to end up with is a dataset that has potential density as a sort of proxy for latitude, but I want longitude and time dimensions to remain intact. The potential density is a variable as are some other surface properties (no depth dimension).

When I use .groupby_bins, it squishes everything down along all other dimensions. After some sleuthing it appears there is no way to groupby_bins over multiple dimensions but there has been discussion of getting that going.

The resulting ds_pdens has dimensions of 'pdens_bins'. What I want is dimensions of 'pdens_bins', 'lon', and 'time'. Should I be trying something else? Thanks!

Thanks, @rabernat ! I think I have it mostly working with one issue: As noted in the xhistogram example for averaging a variable (Xhistogram Tutorial — xhistogram 0.2.0+82.ge01291b.dirty documentation), “NaNs in weights will make the weighted sum as nan.” Since I don’t have data everywhere (this is based on bin-averaged Argo float data) I do want NaNs where I don’t have data, but I’m getting zeros, and those are getting averaged in and messing up the result. Any tips? Thank you!

Here’s what I’m doing, with temperature as an example:

Is it safe to assume that ds.temp and ds.pdens have the same original NaN mask? If so, I think the solution is to just leave the NaNs in pdens. That way those cells will be excluded from the histogram completely. Here’s a little example that illustrates what I mean:

from xhistogram.xarray import histogram
import xarray as xr
import numpy as np
data = xr.DataArray([np.nan, 1, 2, 3], dims=['x'], name='foo')
weights = xr.DataArray([100, 1, 1, 1], dims=['x'])
bins = np.array([0, 1.5, 4])
histogram(data, bins=[bins], weights=weights)

Keep in mind that histogram does a weighted sum, not mean. To get to a “bin average”, you would want to divide the weighted histogram by the unweighted histogram. The T/S example shows how to do this (and also includes a additional volume factor for each cell). I’m a bit confused how your code above is able to produce meaningful temperature values. I would have expected the values to be of the order typical_temperature * number_of_latitude_cells.

Thanks so much for your help, @rabernat. Yes, they have the same NaN mask. I understand now what I was missing (and looking at the intermediate arrays as maps instead of zonal averages helped me understand as well). It was just a fluke that the numbers were somewhat reasonable. Also, I had forgotten to cut out the northern hemisphere before doing this calculation so I had some northern hemisphere dense waters in the mix. Updated with correct method and using data only south of 20S: