Do you anticipate using the new hierarchical DataTree structure in xarray? i.e., with xarray.Dataset objects organized in sub-groups? If so, we have some design questions about the data model on which we would love feedback:
opened 06:31PM - 07 Jun 24 UTC
design question
topic-DataTree
### What is your issue?
Should coordinate variables be inherited between diff… erent levels of an Xarray DataTree?
The DataTree object is intended to represent [hierarchical groups of data](https://xarray-datatree.readthedocs.io/en/latest/) in Xarray, similar to the role of sub-directories in a filesystem or HDF5/netCDF4 groups. A key design question is if/how to enforce coordinate consistency between different levels of a DataTree hierarchy.
As a concrete example of how enforcing coordinate consistency could be useful, consider the following hypothetical DataTree, representing a mix of weather data and satellite images:
<img width="678" alt="image" src="https://github.com/pydata/xarray/assets/1217238/dfd3cddd-7ec8-4f3f-9fe0-b26f9822c595">
Here there are four different coordinate variables, which apply to variables in the DataTree in different ways:
- `time` is a shared coordinate used by both weather and satellite variables
- `station` is used only for weather variables
- `x` and `y` are only use for satellite images
In this data model, coordinate variables are **inherited** to descendent nodes, which means that variables at different levels of a hierarchical DataTree are always aligned. Placing the `time` variable at the root node automatically indicates that it applies to all descendent nodes. Similarly, `station` is in the base `weather_data` node, because it applies to all weather variables, both directly in `weather_data` and in the `temperature` sub-tree. Accessing any of the lower level trees as an `xarray.Dataset` would automatically include coordinates from higher levels (e.g., `time`).
In an alternative data model, coordinate variables at every level of a DataTree are **independent**. This is the model currently implemented in the [experimental DataTree project](https://github.com/xarray-contrib/datatree). To represent the same data, coordinate variables would need to be duplicated alongside data variables at every level of the hierarchy:
<img width="693" alt="image" src="https://github.com/pydata/xarray/assets/1217238/9c3ada2e-4c7e-444d-980a-2618bf20e79f">
Which data model to prefer depends on which of two considerations we value more:
1. **Consistency**: Automatically inherited coordinates will allow for DataTree objects with fewer redundant variables, which is easier to understand at a glance, similar to the role of the shared coordinate system on xarray.Dataset. You don’t need to separately check the `time` coordinates on the weather and satellite data to know that they are the same. Alignment, including matching coordinates and dimension sizes, is enforced by the data model.
2. **Flexibility**: Enforcing consistency limits how you can organize data, because conflicting coordinates at different levels of a DataTree can no longer be represented in Xarray’s data model. In particular, some valid multi-group netCDF4 files/Zarr could not be loaded into a single DataTree object.
As a concrete example of what we lose in flexibility, consider the following two representations of an [multiscale image pyramid](https://forum.image.sc/t/multiscale-arrays-v0-1/37930), where each level of zoom has different x and y coordinates:
<img width="775" alt="image" src="https://github.com/pydata/xarray/assets/1217238/6640dde3-c2d7-4aa8-a901-1a309f06ace2">
The version that places the base image at the root of the hierarchy would not be allowed in the inherited coordinates data model, because there would be conflicting x and y coordinates (or dimension sizes) between the root and child nodes. Instead, different levels of zoom would need to be placed under different groups (`zoom_1x`, `zoom_2x`, etc).
As we consider making this change to the (as yet unreleased) DataTree object in Xarray, I have two questions for prospective DataTree users:
1. Do you agree that giving up the flexibility of independent coordinates in favor of a data model that bakes in more consistency guarantees is a good idea?
2. Do you have existing uses for DataTree objects or multi-group netCDF/Zarr files that would be positively or negatively impacted by this change?
xref: https://github.com/pydata/xarray/pull/9063, https://github.com/pydata/xarray/issues/9056
CC @TomNicholas, @keewis, @owenlittlejohns, @flamingbear, @eni-awowale
(from Stephan’s tweet )
@shoyer
2 Likes