Xarray.dataset.grouby_bins without squishing other dimensions

I’m trying to use xarray.dataset.groupby_bins to get a 1x1 monthly ocean dataset (dimensions of lat, lon time) averaged into potential density bins. Because this is a Southern Ocean dataset, what I want to end up with is a dataset that has potential density as a sort of proxy for latitude, but I want longitude and time dimensions to remain intact. The potential density is a variable as are some other surface properties (no depth dimension).

When I use .groupby_bins, it squishes everything down along all other dimensions. After some sleuthing it appears there is no way to groupby_bins over multiple dimensions but there has been discussion of getting that going.

Example:
bins = np.arange(24, 29, 0.2)
ds_pdens = ds.groupby_bins('pdens', bins).mean()

The resulting ds_pdens has dimensions of 'pdens_bins'. What I want is dimensions of 'pdens_bins', 'lon', and 'time'. Should I be trying something else? Thanks!

2 Likes

Nancy, this is a great question. The limitations of groupby_bins for this sort of application are one thing motivated us to create xhistogram.

What you want to do is quite straightforward in xhistogram. Something like

from xhistogram.xarray import histogram
ds_dens = histogram(ds, bins=[bins], dim=['lat'])
1 Like

Thanks, @rabernat ! I think I have it mostly working with one issue: As noted in the xhistogram example for averaging a variable (Xhistogram Tutorial — xhistogram 0.2.0+82.ge01291b.dirty documentation), “NaNs in weights will make the weighted sum as nan.” Since I don’t have data everywhere (this is based on bin-averaged Argo float data) I do want NaNs where I don’t have data, but I’m getting zeros, and those are getting averaged in and messing up the result. Any tips? Thank you!

Here’s what I’m doing, with temperature as an example:

bins = np.arange(24, 29, 0.2)
weights = ds.temp
ds_pdens = histogram(ds.pdens, bins=bins, weights=weights, dim=['lat'])

ds_pdens.mean(dim='lon').T.plot()
plt.gca().invert_yaxis()
  • Note the axis labels are different because in my real dataset pdens is da_mlpd. I just wanted to make it clear in the example.
    image
1 Like

Is it safe to assume that ds.temp and ds.pdens have the same original NaN mask? If so, I think the solution is to just leave the NaNs in pdens. That way those cells will be excluded from the histogram completely. Here’s a little example that illustrates what I mean:

from xhistogram.xarray import histogram
import xarray as xr
import numpy as np
data = xr.DataArray([np.nan, 1, 2, 3], dims=['x'], name='foo')
weights = xr.DataArray([100, 1, 1, 1], dims=['x'])
bins = np.array([0, 1.5, 4])
histogram(data, bins=[bins], weights=weights)

Keep in mind that histogram does a weighted sum, not mean. To get to a “bin average”, you would want to divide the weighted histogram by the unweighted histogram. The T/S example shows how to do this (and also includes a additional volume factor for each cell). I’m a bit confused how your code above is able to produce meaningful temperature values. I would have expected the values to be of the order typical_temperature * number_of_latitude_cells.

Thanks so much for your help, @rabernat. Yes, they have the same NaN mask. I understand now what I was missing (and looking at the intermediate arrays as maps instead of zonal averages helped me understand as well). It was just a fluke that the numbers were somewhat reasonable. Also, I had forgotten to cut out the northern hemisphere before doing this calculation so I had some northern hemisphere dense waters in the mix. Updated with correct method and using data only south of 20S:

from xhistogram.xarray import histogram
bins = np.arange(24, 29, 0.2)
weights = ds.temp
ds_pdens = histogram(ds.pdens, bins=bins, weights=weights, dim=['lat'])/ histogram(ds.pdens, bins=bins, dim=['lat'])

ds_pdens.mean(dim='lon').T.plot()
plt.gca().invert_yaxis()

image

Here’s a map of just one timestamp of that result as a gut check:
image

1 Like