Best practices for inferring meaning from xarray Datasets?

Greetings all -

I am part of a group building tools to do atmospheric radiation calculations within Python using xarray Datasets to provide inputs and store outputs. We’d like to be user-friendly and convention-conforming in designing the API. We’re grateful for any advice on how best to ask users to provide information.

To do a (clear-sky) radiation problem we need to know the physical state of the atmosphere - the pressure and temperature at both layer edges and layer “centers” on a vertical grid. We also need to know we need to know the concentration of a bunch of gases in the atmosphere. Some concentrations are necessary (water vapor, ozone, carbon dioxide…), others (CFC11, say) are optional, and we don’t necessarily know what concentrations will be available.

We have thought to identify gas concentrations following the CF conventions, which have many standard names of the form “mole_fraction_of_XX_in_air”. Our thinking is assume that any DataArray whose standard_name attribute is of this form represents a gas concentration, and we’ll figure out which gas the data correspond to by parsing the string.

We are less clear what to do about variables whose values we need at both N layer centers and N+1 layer edges, since the standard names don’t describe where the variables are defined. Are there coordinate conventions we could exploit to determine which coordinates are layer centers and which edges?

We do foresee the API allowing users to explicitly specify the mapping from the Dataset being supplied to the layout needed for the calculation, but we’d like to infer the mapping as much as is practical.

We’re grateful for any help based on your collective experience.

Thanks in advance - Robert

1 Like

Robert this is a really interesting question. I don’t have a full answer, but I’ll point out two tools that might help you.

  • CF Xarray: interpretation of CF conventions on Xarray datasets
  • Xgcm: staggered grid awareness for Xarray datasets

If you leverage these, it should be pretty straightforward to do what you need.

This has been a long-running debate within the CF conventions community. Currently CF conventions don’t explicitly describe staggered grids. I opened this issue to try to get it included, but it got swallowed into endless back and forth discussion

The SGRID conventions do exist for this, but there are not a lot of tools that actually implement them. We would like to support SGRID in Xgcm, but it doesn’t work yet

@rabernat Thanks for the pointers to resources for representing grids. I’m aware of xgcm’s ability to represent staggered grids. For this application it seems like overkill but we’ll keep it in mind.

In terms of cf-xarray, you can define a set of custom criteria for the variables you want to be able to recognize, then you can identify them. For example,

# Regex-based criteria to identify sea surface height. I'll be able to then refer to it with my nickname "ssh"
import cf_xarray
my_custom_criteria = {
    "ssh": {
        "standard_name": "sea_surface_height$|sea_surface_elevation|sea_surface_height_above_sea_level$",
        "name": "(?i)sea_surface_elevation(?!.*?_qc)|(?i)sea_surface_height_above_sea_level_geoid_mllw$|(?i)zeta$|(?i)Sea Surface Height(?!.*?_qc)|(?i)Water Surface above Datum(?!.*?_qc)"
    },
}
cf_xarray.set_options(custom_criteria=my_custom_criteria)

# Read in your model output or dataset with xarray and call it `ds`

# assuming only a single variable is identified by the custom criteria, this will return it
# if there is more than one, you can use `ds.cf[['ssh']]`. 
ds.cf['ssh']
2 Likes

Thanks, @kthyng , that’s nifty.