Here are a few examples that I have in mind where vector data cubes could be handy. I lack domain expertise and those examples may be the result of my naive point of view, though. I also hope that I understand well the concept of vector data cubes…
For simplicity I omit array values and data fields (i.e., Xarray data variables) in the data cubes represented below, but there could be many.
Ocean/ESM model outputs vs. in-situ observations (drifting buoys, ship track, etc.)
Model outputs could looks like this:
>>> ds_model
Dimensions: (x: 500, y: 300, time: 365)
Coordinates:
* time (time) datetime64[ns] ...
* grid_center (x, y) POINT ...
Where grid_center
values are cell centers of a 2D curvilinear (static) grid in the form of lat-lon POINT
geometries.
Observational data could be represented like this:
>>> ds_obs
Dimensions: (station: 50, time: 180)
Coordinates:
* time (time) datetime64[ns] ...
* station_id (station) int64 ...
* location (station, time) POINT ...
Where location
contains the time-dependent positions of the drifting measurement stations, also as lat-lon POINT
geometries.
Let’s say we want to extract all model outputs at the locations of the stations (nearest values). With Xarray advanced indexing we could simply write this (assuming that ds_model.grid_center
is baked by an Xarray custom, geographic index):
>>> ds_extract = ds_model.sel(
... grid_center=ds_obs.location,
... time=ds_obs.time,
... method="nearest"
... )
>>> ds_extract
Dimensions: (station: 50, time: 180)
Coordinates:
* time (time) datetime64[ns] ...
* location (station, time) POINT ...
Where ds_extract
also has all data variables present in ds_model
after they’ve been aligned with ds_obs
coordinate labels.
I guess we could achieve the same extraction using tables (Geopandas dataframes) rather than data cubes (Xarray datasets). However, converting the whole ds_model
data cube to a dataframe might not be ideal if model outputs are very large compared to the observations (is it often the case?). Perform the extraction on data cubes and then convert ds_extract
to a dataframe for further analysis seems more convenient to me, although I’m probably biased.
This example already shows what we can do with Xoak, although we rather handle latitude/longitude values explictly as separate coordinates. Probably it is overkill to use coordinates with POINT
geometries here, but maybe we can generalize this example to other geometries (e.g., grid cell boundaries in @rabernat’s use case)?
Analysis of hurricane track forecasts
I have zero knowledge in this domain, but looking at some data available in NHC Data in GIS Formats I could imagine doing a retrospective analysis of previous hurricane track forecasts using vector data cubes in this form:
>>> ds_hurricane_forcasts
Dimensions: (storm_id: 20, forecast_rel_time: 10)
Coordinates:
* storm_id (storm_id) <U4 ...
storm_year (storm_id) int64 ...
storm_name (storm_id) <U15 ...
forecast_time (forecast_rel_time) datetime64[ns] ...
* track (storm_id, forecast_rel_time) LINE ...
* cone (storm_id, forecast_rel_time) POLYGON ...
Again, using a table might be perfectly fine although probably a bit messier compared to the representation above. It would be nice if (eventually with Xarray custom indexes) we could select all storms during a given year that were predicted to make a landfall in a particular area (i.e., intersection of cone
with a coastline) and get the result as another vector data cube preserving the storm_id
and forecast_rel_time
dimensions.
Maybe there can be additional dimensions if we include alternative forecasts (models) in the dataset.
Landscape evolution vs. life evolution
(Note: things may go wild here). In geomorphology we are interested in studying the tectonic vs. erosion processes that both drive the evolution of the landscape. Among other experiments, we are trying to investigate whether we can use biological observations as markers of the past evolution of the landscape. In doing so, we are coupling landscape evolution models (LEMs) with life evolution models run over long-time scales.
The main output of a LEM is a dynamic digital elevation model from which we can extract features such as river networks. The output of a life evolution (agent-based) model can be the spatio-temporal distribution of some population.
If we want to analyse those outputs (e.g., extract individuals near river channels) for a whole batch of simulations, perhaps we could leverage a vector data cube in the following form:
>>> ds_landscape_vs_life
Dimensions: (run_id: 10, time: 1000, individual_id: 2000, river_segment_id: 500)
Coordinates:
* time (time) int64 ...
* river_geom (run_id, time, river_segment_id) LINE ...
* individual_geom (run_id, time, individual_id) POINT ...
river_geom
is time-dependent as erosion processes may affect the river channel paths.
(To be honest I haven’t encountered this precise use case yet, but it doesn’t seem far-fetched to me)