Xarray and collections of forecasts

For a while I’ve been pondering how to represent collections of forecasts in Xarray, and I know others have been too.

While THREDDS has a nice way to assemble multiple model runs as Forecast Model Run Collections, the staggered nature of the coordinates makes them messy to fit into a single xr.Dataset.

After some conversation during scheeming around Xpublish and ZarrDAP (I’ll get back to working on evolving Xpublish soon!), it’s been even more on my mind to the point I couldn’t sleep one night until I wrote a mock up of an API (and sent it to @rsignell and others to try to make it someone else’s problem).

What kept me up till I got a mock up made was figuring out how to take advantage of @TomNicholas work with Datatrees.

dt = xarray_fmrc.from_model_runs([ds0, ds1])
dt
DataTree('None', parent=None)
│   Dimensions:                  (forecast_reference_time: 2,
│                                 constant_forecast: 242, constant_offset: 121)
│   Coordinates:
│     * forecast_reference_time  (forecast_reference_time) datetime64[ns] 2022-12...
│     * constant_forecast        (constant_forecast) datetime64[ns] 2022-12-02 .....
│     * constant_offset          (constant_offset) timedelta64[ns] 06:00:00 ... 5...
│   Data variables:
│       model_run_path           (forecast_reference_time) <U29 'model_run/2022-1...
└── DataTree('model_run')
    ├── DataTree('2022-12-01T18:00:00')
    │       Dimensions:                  (forecast_reference_time: 1, time: 121,
    │                                     latitude: 220, longitude: 215)
    │       Coordinates:
    │         * longitude                (longitude) float64 -79.95 -79.86 ... -60.13 -60.04
    │         * latitude                 (latitude) float64 27.03 27.12 ... 47.32 47.41
    │         * time                     (time) datetime64[ns] 2022-12-02 ... 2022-12-07
    │         * forecast_reference_time  (forecast_reference_time) datetime64[ns] 2022-12...
    │       Data variables:
    │           wind_speed               (forecast_reference_time, time, latitude, longitude) float32 ...
    │           wind_from_direction      (forecast_reference_time, time, latitude, longitude) float32 ...
    │       Attributes: (12/178)
    │           ...
    └── DataTree('2022-12-12T18:00:00')
            Dimensions:                  (forecast_reference_time: 1, time: 121,
                                          latitude: 220, longitude: 215)
            Coordinates:
              * longitude                (longitude) float64 -79.95 -79.86 ... -60.13 -60.04
              * latitude                 (latitude) float64 27.03 27.12 ... 47.32 47.41
              * time                     (time) datetime64[ns] 2022-12-13 ... 2022-12-18
              * forecast_reference_time  (forecast_reference_time) datetime64[ns] 2022-12...
            Data variables:
                wind_speed               (forecast_reference_time, time, latitude, longitude) float32 ...
                wind_from_direction      (forecast_reference_time, time, latitude, longitude) float32 ...
            Attributes: (12/178)
                ...

I think that by placing each model run in the datatree, then accessor methods could be used for the various views that users want from a collection of forecasts. Thus methods along the lines of dt.fmrc.constant_offset("12H") to get values that are 12 hours from the time that a forecast was generated or dt.fmrc.best() to give a dataset with the best forecast data (least amount of time between generation and a forecasted time).

In trying to go from a quickly slapped together mock up to a post with slightly easier to understand example I ended up writing enough code tinkering that the mock up has started to become a reality: xarray_fmrc.

I know there has also been some exploration of using kerchunk.subchunk() and other ways to represent forecast collections, so I’d love to hear other thoughts.

6 Likes

This is very cool @abkfenris! Our community has wrestled with the question of how to best to organize forecast data for a long time…but now we have Datatree! :pray: :chart_with_upwards_trend: :evergreen_tree: I think you’re heading in the right direction.

If you haven’t seen this Pangeo Forge thread, it might have some useful insights from @chiaral

This is great to see @abkfenris! I would love to help out however I can.

If you have any thoughts on what kind of API improvements on datatree would help you then I’m all ears. There is some discussion on API methods here. I actually added a .filter method just now which might be useful.

1 Like

I’ve gotten the accessor methods working at least for my few test cases, so dt.fmrc.best(), dt.fmcr.constant_offset("6h"), and dt.fmrc.constant_forecast("2023-01-10") all work now.

@TomNicholas I’d definitely love some help. I’m somewhat just using datatree as a pre-existing structure to jam model runs into and somewhat naively looping over them.

While filter might take care of this, it would also be nice to be able to filter nodes with broadcasted dataset methods.

Maybe an extra argument along the lines of dt.sel(time=dt, _filter=True) to catch KeyErrors and instead only return a subset of the tree with the selection.

@rabernat sometimes it seems like the only commonality amongst forecasts is that you can at least expect that all the data generated at the same time will be in the same place and be able to be used together.

That’s why I’m treating each one as it’s own group when storing in the tree. Only when you try to use a specific accessor method does it actually go in and try to get them to play together. At that point you should have enough information to filter out the unwanted duplicate data that would otherwise make it ugly to jam into a dataset.

2 Likes

Amazing!

I can see how that would be useful. If I’m understanding correctly this is quite a similar suggestion to in this datatree issue, just you’re talking about ignoring errors from missing variables instead of ignoring errors from missing dimensions. ANother idea would be to add something like dt.sel(time=dt, errors="ignore") inspired by the option xarray’s .drop_vars already has.

@abkfenris in general any snippets of “I would like it if this code acted like this” is really helpful to me.

1 Like

I thought I had seen an issue along those lines, but I think I searched ‘coordinates’ and ‘exception’ (as technically I’m selecting a non-dimension coordinate in one case) so I didn’t find that one. dt.sel(time=dt, errors="ignore") would work great. I’ve got some more ideas that I’ll continue in the issue.

A method like .concat_child_datasets would also be helpful. Between those two it could simplify two of my methods from.

    def constant_offset(self, offset: Union[str, int, float, timedelta]) -> xr.Dataset:
        timedelta = pd.to_timedelta(offset)

        filtered_ds = []

        for child in self.datatree_obj["model_run"].children.values():
            try:
                selected = child.ds.sel(forecast_offset=timedelta)
                filtered_ds.append(selected)
            except KeyError:
                pass

        combined = xr.concat(filtered_ds, "time")
        combined = combined.sortby("time")
        return combined

to

    def constant_offset(self, offset: Union[str, int, float, timedelta]) -> xr.Dataset:
        return self.datatree_obj["model_run"].sel(forecast_offset=offset).concat_child_datasets("time").sortby("time")

I think it could allow other datatrees with similarly really closely related but non-alignable datasets to be quickly collapsed into aligned datasets with less boilerplate.

2 Likes

.concat_child_datasets is analogous to the .merge_child_datasets idea we discussed at the LEAP workshop last week. It seems like collapsing datatrees into datasets could be a broadly useful feature.

1 Like

concat_child_datasets could be something like

def concat_child_datasets(self, **concat_options):
    collapsed_ds = xr.concat([node.ds for node in self.subtree], **concat_options)

    self.ds = collapsed_ds
    
    # then delete / orphan all the children of this node

    return self

if you’re interested in submitted a PR?.. Else I’ll try to get to it soon.

I was thinking that concat_child_datasets would return a dataset rather than a datatree.

How about concat_children() and merge_children() that return single nodes as then datasets are also directly accessible from .ds?

1 Like

You mentioned forecast data and python has a community for that climpred: verification of weather and climate forecasts — climpred documentation called climpred. It could be possible to collaborate with them to see how forecast data is currently processed and they might have an idea.

2 Likes

That’s what I was thinking too, I just didn’t express it very clearly. I’ve raised an issue to continue this discussion

1 Like

Hi @kdl0013 thanks for the reminder about climpred! I had forgotten about that project and it was interesting to explore what’s happened since I last checked it out.

Climpred has made things work by reshaping forecasts into init (forecast_reference_time, the time the forecast was ‘made’) and lead (integer offset with units encoded as an attribute) dimensions so that they can be aggregated into a single dataset.

So far, I’ve been largely approaching this problem with my NERACOOS data manager hat on. Most of our users (fishermen, Coast Guard, local communities) really just want the best available data and aren’t worried about individual model runs, but I’ve also got a handful of scientists who are interested in those. We are currently running ERDDAP and THREDDS servers, but don’t have full THREDDS FMRC serving set up (instead I’ve got something manually hacked together with separate datasets for each model run).

For us (and I’m guessing a lot of other orgs), it probably doesn’t make too much sense to change our model storage to be init x lead shaped with our normal usage patterns.

Since there is enough information to get init x lead (even with some changes I’m pondering below), I do think it would be reasonable to provide a method that could return a tree as a climpred compatible dataset.

Right now I’m capturing forecast_reference_time (or init) as a dimension and forecast_period (or lead) as a non-dimension coordinate (forecast_offset) by adding them to each dataset within the tree. forecast_reference_time is also captured in the path within the tree (model_run/{forecast_reference_time}). I’m also making a dataset at the root with that info.

Looking at how Kerchunk wants to do aggregations, it probably makes the most sense to avoid doing any reshaping of datasets with extra data. xarray-fmrc could depend on just the forecast_reference_time in the path and derive all the related info as needed for the reshaping it’s doing (well, as long as there is a reasonable time dimension, but I think even that is manageable). That way Kerchunk can refer to existing model data as is and we can still access them in the various FMRC-ish and climpred ways.

3 Likes

Slightly off topic, but just in case it’s useful here in some way, here is my GEFS kerchunking notebook.

2 Likes

Love the approach you’re taking with datatree @abkfenris. When we developed climpred, datatree wasn’t around. In hindsight, we effectively created something like it with our nested dictionary structure to hold forecasts, observations, perfect model runs, etc. We also hacked together our classes to apply the same function across all objects (climpred/classes.py at main · pangeo-data/climpred · GitHub I think under __getattr__). It would be great to make it so climpred accepts datatree objects. Although I left for an industry job without much time/legality for open-source work, as did @aaronspring I believe. Would be great for others to pick up the torch if the package still provides value for the community.

4 Likes