Xarray Accessor for Georeferenced Data Comparisons

Pangeo Community,

We are exploring the possibility of extending xarray with a new accessor and would like to share our idea with you as recommended on the extending xarray page.

Our proposal involves creating an xarray accessor that compares georeferenced DataArrays and Datasets through Rioxarray. The comparisons will be based on scoring philosophies for three statistical data types: categorical, continuous, and probabilistic. The accessor functions will need to process and align the xarrays before performing comparisons. Once the georeferenced xarrays are homogenized, an agreement xarray can be computed, with its structure varying based on the statistical data type used. The comparison process will also generate agreement metrics, which will also depend on the data types involved. Furthermore, we aim to incorporate attributes from the compared datasets into the resulting agreement outputs. These attributes might be sourced directly from the data files or potentially integrated through a cataloging approach.

We acknowledge the existence of established projects like climpred along with xskillscore, but we feel that at least climpred might be too domain-specific to climate data. As our goal is to work with 2D/3D raster data models found within GDAL, we believe that comparing these models with high level functionality isn’t represented in the current stack.

We appreciate your time in reviewing our proposal and welcome any feedback or suggestions you may have.

Thank you for your consideration.

3 Likes

Thanks for starting this conversation! One quick comment: it sounds like there are at least two components to the type of operation that you want to do:

  1. (Geospatially) align two or more xarray objects
  2. Compute the comparison metrics on the aligned objects

The first one sounds fairly generic, and is perhaps already implemented in libraries like rioxarray (Example - Reproject Match (For Raster Calculations/Stacking) — rioxarray 0.14.0 documentation).

Some other folks here like @raybellwaves or @aaronspring might have more thoughts on the potential overlap with xskillscore and climpred.

1 Like

Sounds like a useful package. I’d encourage some discussion with existing projects and maintainers to coordinate efforts.

For example, I’d definitely talk to @kirill.kzb about odc-geo. We should be able to share the core geo accessor across more specialized packages.

IMO the lack of a common in-memory representation of CRS information across our ecosystem is a big problem. (Compare the python ecosystem to R and you’ll see what I mean.) Geo-xarray was supposed to address this but it never went anywhere. It’s worth some time trying to align efforts here.

4 Likes

@TomAugspurger Thanks for getting the conversation started on this.

For the spatial alignment of xarray’s, we do plan on using reproject_match as you suggested. We plan to do a light wrapper on this to add some comparison context, expose to an accessor, check for spatial alignment prior to using reproject_match (we understand it doesn’t do this already?), and extend the concept of alignment to other file/catalog attributes (e.g. temporal alignment).

Thanks for connecting us with the other folks as well!

@rabernat
Thanks for exposing us to the inconsistent georeferencing problem. It appears as if rioxarray, cdc-geo, and metpy all appear to have disparate georeferencing objects. I am sure there are others?

Is odc-geo considered the core geo accessor? Outside of calls to Rioxarray functionality, should we depend on this CRS information for consistency across the ecosystem? @kirill.kzb

Some of us are trying to make some progress here: Experimentally support CRSIndex · Issue #588 · corteva/rioxarray · GitHub with the idea that sticking CRS information in an Index object will smooth over a number of pain points related to CRS propagation, and raising errors when CRS info is mismatched.

cc @scottyhq @JessicaS11

2 Likes

odc-geo and rioxarray produce equivalent in-memory representations of CRS and hence can extract spatial information from xarray objects produced by each other. Representation is loosely based on the netcdf data model. There is a special dimensionless coordinate added to an xarray data array, and attributes of that coordinate contain CRS information. By storing CRS information in a coordinate variables rather than directly in the attributes of the data variable allows for that information to propagate through most operations one can perform on an array.

rioxarray provides a .rio. accessor that exposes a number of methods to query spatial information. Similarly in odc-geo, there is .odc. accessor. odc-geo also implement GeoBox class GeoBox Model — odc-geo 0.3.3 documentation that provides a number of useful operations and handy visualizations.

3 Likes