Vector data cubes

Hey,

before I’ll get deeper into this, I wanted to check if there is any implementation of vector data cubes on top of xarray.

Vector data cubes are implemented in the R package stars and using their words, " vector data cubes are n-D arrays that have (at least) a single spatial dimension that maps to a set of (2-D) vector geometries". Here’s a nice overview of the concept - Vector Data Cubes.

In Python, it means that one or more coordinate axes of an xarray.DataArray is an array of shapely geometries (ideally, rather than WKT). I have started looking into that in this notebook > > > vector-data-cube-exploration.ipynb | notebooksharing.space figuring out what works out of box and what needs some work.

I think that it can eventually be a nice bridge between a raster world of xarray and a vector world of GeoPandas. I imagine it being built similarly to rioxarray. The goal is to have a user-friendly API allowing seamless conversions between xarray and GeoPandas objects allowing plotting and spatial operations without a necessity to jump between the two manually. Shapely 2.0 is very promising in this direction.

Before I try to do more - are you aware of any work in a similar direction? Would you be interested in this? Anyone who’d like to collaborate?

11 Likes

When looking into this I wondered what xagg does with the geometries when it returns an array: xagg — xagg 2.4 documentation

1 Like

Cool notebook! See Experimentally support CRSIndex · Issue #588 · corteva/rioxarray · GitHub for a proposal for a CRSIndex to handle CRS objects at the DataArray level.

I bet you could also create a GeometryIndex to allow more convenient indexing along the geometry coordinate, but I have no experience in this area.

2 Likes

Thanks @dcherian! It would probably make sense to create a custom class and expose STRtree spatial index in indexing. We already have things like nearest in shapely so you should be able to get the closest match to any given geometry object.

@edzer as far as I understand, xagg first create vector grid from raster, does overlay to understand portions of pixels falling into each geometry (to get weights) and then uses those to interpolate data from grid to geometry. You should be then able to assign that array to original GeoDataFrame to link it to an input geometry.

2 Likes

Thanks for sharing @martinfleis. I see that “long table form” is mentioned at Vector Data Cubes, which is how I would have normally thought to model this data. It’ll be interesting to see which form is most convenient for which operations (and hopefully the various DataFrame.to_xarray and Dataset.to_pandas methods can efficiently move between these forms, but I’m not quite sure which operations can be zero-copy of data).

And just to be completely clear, since it confused me at first, this is different from Tables, (x)arrays, and rasters¶, where rasters might be stored as columns in a tabular structure. IIUC, the unique thing about vector data cubes is having a coordinate dimension that’s vector labels? Again, I think all these representations might be valuable depending on the use case.

1 Like

When I first heard “Vector Data Cube”, I thought of something different.

In Conservative Region Aggregation with Xarray, Geopandas and Sparse I had a slightly different use case. I wanted a 2D array where each value was a vector polygon (GCM grid cell), and I wanted to intersect these with another set of geometries. Because Xarray does not implement vectorized overlay (like geopandas), I had to convert the dataarray to a multiindexed geodataframe.

I’m wondering if there are things we could do to support both workflows. But I don’t want to hijack the thread…

Yes, this is something else than your experiment. You got it right (I think), the main goal is to be able to use an array of shapely geometries as coordinates and use their unique spatial nature in indexing and such.

@rabernat Haven’t properly thought about that, though it should be theoretically possible with shapely 2.0 as well. I am just a bit afraid that things like spatial indexing on an array of geometries that is not 1D would break. Shapely does not expect this even though the underlying numpy ufunc will probably deal with most of the cases itself. Would need to experiment on that, it is a slightly different use case. Btw, that approach you used is practically the same we have in tobler for areal interpolation (which just seems to be another term for conservative regrinding).

One solution could be to internally ravel (unravel) the n-d coordinate and/or indexer labels before (after) querying the index. Xoak uses this approach, which works pretty well with Xarray advanced indexing.

1 Like

Vector data cubes have also a great potential for handling dynamic geospatial entities via multi-dimensional coordinates of shapely 2.0 (or other) geometry objects, e.g., ocean drifting buoys, evolving coastlines, historical country boundaries, etc. In spatial modeling there could be even more dimensions (sensitivity analyses, etc.).

@benbovy I think that there are two different types of vector data cubes to distinguish. One is the case @edzer implemented in stars where a geometry is used as an index (or coordinates in the xarray terminology). Another is when you want to use geometry as values within the cube. Both are super interesting and I hope we’ll eventually manage to get them both working but each cover slightly different use case.

@edzer have you used sf geometries as values within stars array before?

No, although technically you could. I see data cubes as starting with two-dimensional arrays, if it’s one-dimensional it’s a vector; for a trajectory you could have a usual table with a geometry column and a time column, and some further attribute columns. I don’t see the use case for a matrix or higher-dimensional array where geometries are the array values: what would be the coordinates / dimension labels and the meaning of the geometries, @benbovy ?

Here are a few examples that I have in mind where vector data cubes could be handy. I lack domain expertise and those examples may be the result of my naive point of view, though. I also hope that I understand well the concept of vector data cubes…

For simplicity I omit array values and data fields (i.e., Xarray data variables) in the data cubes represented below, but there could be many.

Ocean/ESM model outputs vs. in-situ observations (drifting buoys, ship track, etc.)

Model outputs could looks like this:

>>> ds_model
Dimensions:  (x: 500, y: 300, time: 365)
Coordinates:
  * time         (time) datetime64[ns] ...
  * grid_center  (x, y) POINT ...

Where grid_center values are cell centers of a 2D curvilinear (static) grid in the form of lat-lon POINT geometries.

Observational data could be represented like this:

>>> ds_obs
Dimensions:  (station: 50, time: 180)
Coordinates:
  * time         (time) datetime64[ns] ...
  * station_id   (station) int64 ...
  * location     (station, time) POINT ...

Where location contains the time-dependent positions of the drifting measurement stations, also as lat-lon POINT geometries.

Let’s say we want to extract all model outputs at the locations of the stations (nearest values). With Xarray advanced indexing we could simply write this (assuming that ds_model.grid_center is baked by an Xarray custom, geographic index):

>>> ds_extract = ds_model.sel(
...    grid_center=ds_obs.location,
...    time=ds_obs.time,
...    method="nearest"
... )
>>> ds_extract
Dimensions:  (station: 50, time: 180)
Coordinates:
  * time         (time) datetime64[ns] ...
  * location     (station, time) POINT ...

Where ds_extract also has all data variables present in ds_model after they’ve been aligned with ds_obs coordinate labels.

I guess we could achieve the same extraction using tables (Geopandas dataframes) rather than data cubes (Xarray datasets). However, converting the whole ds_model data cube to a dataframe might not be ideal if model outputs are very large compared to the observations (is it often the case?). Perform the extraction on data cubes and then convert ds_extract to a dataframe for further analysis seems more convenient to me, although I’m probably biased.

This example already shows what we can do with Xoak, although we rather handle latitude/longitude values explictly as separate coordinates. Probably it is overkill to use coordinates with POINT geometries here, but maybe we can generalize this example to other geometries (e.g., grid cell boundaries in @rabernat’s use case)?

Analysis of hurricane track forecasts

I have zero knowledge in this domain, but looking at some data available in NHC Data in GIS Formats I could imagine doing a retrospective analysis of previous hurricane track forecasts using vector data cubes in this form:

>>> ds_hurricane_forcasts
Dimensions:  (storm_id: 20, forecast_rel_time: 10)
Coordinates:
  * storm_id       (storm_id) <U4 ...
    storm_year     (storm_id) int64 ...
    storm_name     (storm_id) <U15 ...
    forecast_time  (forecast_rel_time) datetime64[ns] ...
  * track          (storm_id, forecast_rel_time) LINE ...
  * cone           (storm_id, forecast_rel_time) POLYGON ...

Again, using a table might be perfectly fine although probably a bit messier compared to the representation above. It would be nice if (eventually with Xarray custom indexes) we could select all storms during a given year that were predicted to make a landfall in a particular area (i.e., intersection of cone with a coastline) and get the result as another vector data cube preserving the storm_id and forecast_rel_time dimensions.

Maybe there can be additional dimensions if we include alternative forecasts (models) in the dataset.

Landscape evolution vs. life evolution

(Note: things may go wild here). In geomorphology we are interested in studying the tectonic vs. erosion processes that both drive the evolution of the landscape. Among other experiments, we are trying to investigate whether we can use biological observations as markers of the past evolution of the landscape. In doing so, we are coupling landscape evolution models (LEMs) with life evolution models run over long-time scales.

The main output of a LEM is a dynamic digital elevation model from which we can extract features such as river networks. The output of a life evolution (agent-based) model can be the spatio-temporal distribution of some population.

If we want to analyse those outputs (e.g., extract individuals near river channels) for a whole batch of simulations, perhaps we could leverage a vector data cube in the following form:

>>> ds_landscape_vs_life
Dimensions:  (run_id: 10, time: 1000, individual_id: 2000, river_segment_id: 500)
Coordinates:
  * time             (time) int64 ...
  * river_geom       (run_id, time, river_segment_id) LINE ...
  * individual_geom  (run_id, time, individual_id) POINT ...

river_geom is time-dependent as erosion processes may affect the river channel paths.

(To be honest I haven’t encountered this precise use case yet, but it doesn’t seem far-fetched to me)

I’m not familiar with stars, but for Xarray there’s actually not much distinction between coordinates and data variables: both are n-d arrays and any data variable can easily be promoted / unpromoted using Dataset.set_coords() / Dataset.reset_coords(). Taking the example of hurricane track forecasts in my previous comment, I can imagine how useful it would be to wrap shapely.intersect() with xarray.apply_ufunc() so that it could be applied on both the cone and track arrays with a single API call, e.g., to select all tracks that cross one or more coastlines.

The only difference is that a coordinate may be baked by an index. But here again, as Xarray now supports custom (non-pandas) indexes, 1-d vs. n-d coordinate doesn’t matter much anymore, it is rather an implementation detail. For example, one could provide an Xarray index that internally builds a shapely.STRTree index from the flattened coordinate labels and that handles conversion between the flattened indices and those in the original coordinate shape.

1 Like

@martinfleis I took a stab at implementing an Xarray-compatible ShapelySTRTreeIndex (cf. @dcherian’s suggestion).

The result is at the bottom of a new version of your notebook that I uploaded here: vector-data-cube-exploration.ipynb | notebooksharing.space.

It enables two modes of geometry-based selection:

  • “nearest” mode (default): selects the nearest geometry for each of the given input geometries
  • “query” mode: selects one or more geometries given a single input geometry and a predicate

It already works pretty well although there’s certainly more to explore, e.g., if / how we can leverage shapely.STRtree for the alignment of Xarray objects (spatial join, predicate, tolerance?). Also for simplicity it is currently limited to 1-dimensional coordinates.

3 Likes

@benbovy This is great! I wanted to do the same ahead of Pangeo showcase call yesterday but didn’t find time in the end.

I think it would make sense to move the discussion to an actual repository now. Do you think that there’s a home for this in some existing package extending xarray? I am not that aware of what is around. If not, I can create something like xvec.

1 Like

I think that a new repository would make prefect sense for a vector data cube Xarray extension! Once it is usable, we can eventually move it into the xarray-contrib · GitHub organization.

2 Likes

another option would be to put the index into geoxarray

1 Like

Yes, although it looks like its development has stalled? Also, the scope of geoxarray seems much wider than the “vector data cubes” scope that is well defined also in terms of dependencies (geopandas, shapely, vector data IO libraries).

Maybe geoxarray could eventually serve as a glue between more specialized packages (e.g., rioxarray, xvec, pyresample, etc.)? There are general issues (e.g., how to attach CRS to Xarray objects?) that need to be addressed across the Xarray geospatial ecosystem, but it’s probably beyond the scope of this thread.

cc @djhoese

2 Likes

I can’t say I followed everything in this thread, but @benbovy I’d say your understanding matches my expectations for geoxarray. The original purpose of geoxarray was to hold the shared logic for CRS handling between packages and provide a consistent interface to access geo-related functionality for users. So if you wanted to resample to a new projection/grid but use any one of the many options out there (gdal/rasterio/rioxarray, pyresample, verde, etc) it would be the same accessor.

Consistent indexes and coordinates are a major part of that, but other projects had either not settled on best practice for some of these or were waiting for more custom index handling in xarray. I’m not sure what your feelings are on xoak’s involvement, but I’m ok with geoxarray becoming whatever it needs to be to serve the community. If that means storing TreeIndex classes to manage shapely objects in a DataArray then that’s fine by me, but I’d really like if shapely wasn’t a hard dependency. I was hoping only pyproj would be, but maybe shapely is too useful to some of the optimizations a library like geoxarray could have that it should be a hard requirement. :thinking:

2 Likes

I propose to start development in a dedicated repository and once it will become more fleshed out we can have a discussion on integration again. It may be moved to geoxarray or under xarray-contrib umbrella then. I also don’t want to put any burden on geoxarray devs while working on this.

3 Likes