For a while I’ve been pondering how to represent collections of forecasts in Xarray, and I know others have been too.
While THREDDS has a nice way to assemble multiple model runs as Forecast Model Run Collections, the staggered nature of the coordinates makes them messy to fit into a single xr.Dataset
.
After some conversation during scheeming around Xpublish and ZarrDAP (I’ll get back to working on evolving Xpublish soon!), it’s been even more on my mind to the point I couldn’t sleep one night until I wrote a mock up of an API (and sent it to @rsignell and others to try to make it someone else’s problem).
What kept me up till I got a mock up made was figuring out how to take advantage of @TomNicholas work with Datatrees.
dt = xarray_fmrc.from_model_runs([ds0, ds1])
dt
DataTree('None', parent=None)
│ Dimensions: (forecast_reference_time: 2,
│ constant_forecast: 242, constant_offset: 121)
│ Coordinates:
│ * forecast_reference_time (forecast_reference_time) datetime64[ns] 2022-12...
│ * constant_forecast (constant_forecast) datetime64[ns] 2022-12-02 .....
│ * constant_offset (constant_offset) timedelta64[ns] 06:00:00 ... 5...
│ Data variables:
│ model_run_path (forecast_reference_time) <U29 'model_run/2022-1...
└── DataTree('model_run')
├── DataTree('2022-12-01T18:00:00')
│ Dimensions: (forecast_reference_time: 1, time: 121,
│ latitude: 220, longitude: 215)
│ Coordinates:
│ * longitude (longitude) float64 -79.95 -79.86 ... -60.13 -60.04
│ * latitude (latitude) float64 27.03 27.12 ... 47.32 47.41
│ * time (time) datetime64[ns] 2022-12-02 ... 2022-12-07
│ * forecast_reference_time (forecast_reference_time) datetime64[ns] 2022-12...
│ Data variables:
│ wind_speed (forecast_reference_time, time, latitude, longitude) float32 ...
│ wind_from_direction (forecast_reference_time, time, latitude, longitude) float32 ...
│ Attributes: (12/178)
│ ...
└── DataTree('2022-12-12T18:00:00')
Dimensions: (forecast_reference_time: 1, time: 121,
latitude: 220, longitude: 215)
Coordinates:
* longitude (longitude) float64 -79.95 -79.86 ... -60.13 -60.04
* latitude (latitude) float64 27.03 27.12 ... 47.32 47.41
* time (time) datetime64[ns] 2022-12-13 ... 2022-12-18
* forecast_reference_time (forecast_reference_time) datetime64[ns] 2022-12...
Data variables:
wind_speed (forecast_reference_time, time, latitude, longitude) float32 ...
wind_from_direction (forecast_reference_time, time, latitude, longitude) float32 ...
Attributes: (12/178)
...
I think that by placing each model run in the datatree, then accessor methods could be used for the various views that users want from a collection of forecasts. Thus methods along the lines of dt.fmrc.constant_offset("12H")
to get values that are 12 hours from the time that a forecast was generated or dt.fmrc.best()
to give a dataset with the best forecast data (least amount of time between generation and a forecasted time).
In trying to go from a quickly slapped together mock up to a post with slightly easier to understand example I ended up writing enough code tinkering that the mock up has started to become a reality: xarray_fmrc.
I know there has also been some exploration of using kerchunk.subchunk()
and other ways to represent forecast collections, so I’d love to hear other thoughts.