Best practices for inferring meaning from xarray Datasets?

Robert.Pincus · August 30, 2021, 2:57pm

Greetings all -

I am part of a group building tools to do atmospheric radiation calculations within Python using xarray Datasets to provide inputs and store outputs. We’d like to be user-friendly and convention-conforming in designing the API. We’re grateful for any advice on how best to ask users to provide information.

To do a (clear-sky) radiation problem we need to know the physical state of the atmosphere - the pressure and temperature at both layer edges and layer “centers” on a vertical grid. We also need to know we need to know the concentration of a bunch of gases in the atmosphere. Some concentrations are necessary (water vapor, ozone, carbon dioxide…), others (CFC11, say) are optional, and we don’t necessarily know what concentrations will be available.

We have thought to identify gas concentrations following the CF conventions, which have many standard names of the form “mole_fraction_of_XX_in_air”. Our thinking is assume that any DataArray whose standard_name attribute is of this form represents a gas concentration, and we’ll figure out which gas the data correspond to by parsing the string.

We are less clear what to do about variables whose values we need at both N layer centers and N+1 layer edges, since the standard names don’t describe where the variables are defined. Are there coordinate conventions we could exploit to determine which coordinates are layer centers and which edges?

We do foresee the API allowing users to explicitly specify the mapping from the Dataset being supplied to the layout needed for the calculation, but we’d like to infer the mapping as much as is practical.

We’re grateful for any help based on your collective experience.

Thanks in advance - Robert

rabernat · August 30, 2021, 5:24pm

Robert this is a really interesting question. I don’t have a full answer, but I’ll point out two tools that might help you.

CF Xarray: interpretation of CF conventions on Xarray datasets
Xgcm: staggered grid awareness for Xarray datasets

If you leverage these, it should be pretty straightforward to do what you need.

This has been a long-running debate within the CF conventions community. Currently CF conventions don’t explicitly describe staggered grids. I opened this issue to try to get it included, but it got swallowed into endless back and forth discussion

github.com/cf-convention/discuss

"mesh variable" instead of "boundary variable" for contiguous grid cells

opened 05:16AM - 23 Nov 19 UTC

rabernat

I work every day with ocean models that use orthogonal curvilinear coordinates (…MITgcm, MOM, POP, ROMS, NEMO, etc. etc.). This is an example tripolar grid from CESM: ![image](https://user-images.githubusercontent.com/1197350/209829079-a89cd43c-faaa-44ee-a9e7-5b3f3db6eb86.png) The grid cells in such models are contiguous quads, with four points specifying the lat / lon vertex locations of each cell. CF conventions tell me ([Section 7.1: Cell Boundaries](http://cfconventions.org/cf-conventions/cf-conventions.html#cell-boundaries)) that I should use a *boundary_variable*. > A boundary variable will have one more dimension than its associated coordinate or auxiliary coordinate variable. > In the case where the horizontal grid is described by two-dimensional auxiliary coordinate variables in latitude `lat(n,m)` and longitude `lon(n,m)`, and the associated cells are four-sided, then the boundary variables are given in the form `latbnd(n,m,4)` and `lonbnd(n,m,4)`, where the trailing index runs over the four vertices of the cells This convention is general enough to accommodate potentially overlapping or non-contiguous quads, essentially `n x m` totally unrelated four-sided shapes. My main point: **It's inefficient to store structured grid geometry this way.** > The bounds can be used to decide whether cells are contiguous via the following relationships... I don't want to have to check this, I want the conventions to tell me. In our latest global high-resolution ocean models, I have a mesh that is of size `n=12960, m=17280`, 223 million cells. I am interested in streamlining my analysis and visualization workflow as much as possible, which means minimizing the required memory and computational steps. Instead of specifying a boundary variable, I propose to introduce the concept of a **mesh variable**, with the following conventions: - A mesh variable will have _the same number of dimensions_ as its associated coordinate or auxiliary coordinate variable, but with _one extra element in each dimension_. - In the case where the horizontal grid is described by two-dimensional auxiliary coordinate variables in latitude `lat(n,m)` and longitude `lon(n,m)`, and the associated cells are four-sided _and contiguous_, then the mesh variables are given in the form `latmesh(n+1, m+1)` and `lonmesh(n+1, m+1)`. It would not be hard to generate such data, since this is how most GCMs keep track of their own coordinate grids internally (e.g. [MITgcm](https://mitgcm.readthedocs.io/en/latest/algorithm/horiz-grid.html)). This convention also aligns well with how most visualization software plots such data, e.g. [matplotlib's pcolormesh function](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.pcolormesh.html). So adding something like this to the CF conventions would streamline the path from model output to plotting, eliminating the potentially error-prone step of encoding, and then decoding, the "boundary variable" type coordinates. For the dataset I described above, the difference is about 3 GB of memory. I don't feel strongly about what it's called. Maybe "mesh variable" is not the right choice. But I feel something like this is sorely needed. cc @adcroft & @StephenGriffies, with whom this topic has come up repeatedly.

The SGRID conventions do exist for this, but there are not a lot of tools that actually implement them. We would like to support SGRID in Xgcm, but it doesn’t work yet

github.com/xgcm/xgcm

Support SGRID conventions

opened 07:31AM - 21 Jul 18 UTC

closed 03:27PM - 14 Apr 23 UTC

rabernat

Since we started working on xgcm, a new convention for staggered grid netCDF met…adata has emerged: http://sgrid.github.io/sgrid (We learned about this thanks to @vrx- in #108.) Reading through the document, I am pleased to see a huge amount of conceptual overlap between the sgrid and xgcm data models. This is a good thing! It means that there is one clear solution to this problem, and multiple independent groups have more-or-less converged on it. We need to update xgcm to understand sgrid conventions (like it currently does with comodo conventions) so it can automatically build a Grid object based on just the sgrid metadata. This will require a few changes under the hood (see #108), bur should be doable. I wanted to ping some sgrid developers (@hrajagers, @rsignell-usgs) to get their thoughts on how we could work together. Particularly useful would be if you could point us towards some example netCDF files from real models that implement the conventions.

Robert.Pincus · August 30, 2021, 5:51pm

@rabernat Thanks for the pointers to resources for representing grids. I’m aware of xgcm’s ability to represent staggered grids. For this application it seems like overkill but we’ll keep it in mind.

kthyng · September 7, 2021, 2:48pm

In terms of cf-xarray, you can define a set of custom criteria for the variables you want to be able to recognize, then you can identify them. For example,

# Regex-based criteria to identify sea surface height. I'll be able to then refer to it with my nickname "ssh"
import cf_xarray
my_custom_criteria = {
    "ssh": {
        "standard_name": "sea_surface_height$|sea_surface_elevation|sea_surface_height_above_sea_level$",
        "name": "(?i)sea_surface_elevation(?!.*?_qc)|(?i)sea_surface_height_above_sea_level_geoid_mllw$|(?i)zeta$|(?i)Sea Surface Height(?!.*?_qc)|(?i)Water Surface above Datum(?!.*?_qc)"
    },
}
cf_xarray.set_options(custom_criteria=my_custom_criteria)

# Read in your model output or dataset with xarray and call it `ds`

# assuming only a single variable is identified by the custom criteria, this will return it
# if there is more than one, you can use `ds.cf[['ssh']]`. 
ds.cf['ssh']

Robert.Pincus · September 10, 2021, 10:52pm

Thanks, @kthyng , that’s nifty.

Topic		Replies	Views
Xarray needs special treatment for bounds variables? Data	2	642	August 3, 2020
Sep 27, 2023: "Intake 2: The Future", Martin Durant Pangeo Showcase	10	811	October 4, 2023
New to Pangeo? A Quickstart Guide for Data Analysts and Engineers Education	4	1829	November 10, 2022
First 2023 Pangeo showcase at the Feb 1 community meeting! News & Announcements	1	1037	January 27, 2023
Feedback on Coordinate Inheritance in xarray.DataTree	0	142	June 13, 2024

Best practices for inferring meaning from xarray Datasets?

Related topics