Xarray and collections of forecasts

abkfenris · January 6, 2023, 2:55am

For a while I’ve been pondering how to represent collections of forecasts in Xarray, and I know others have been too.

While THREDDS has a nice way to assemble multiple model runs as Forecast Model Run Collections, the staggered nature of the coordinates makes them messy to fit into a single xr.Dataset.

After some conversation during scheeming around Xpublish and ZarrDAP (I’ll get back to working on evolving Xpublish soon!), it’s been even more on my mind to the point I couldn’t sleep one night until I wrote a mock up of an API (and sent it to @rsignell and others to try to make it someone else’s problem).

What kept me up till I got a mock up made was figuring out how to take advantage of @TomNicholas work with Datatrees.

dt = xarray_fmrc.from_model_runs([ds0, ds1])
dt
DataTree('None', parent=None)
│   Dimensions:                  (forecast_reference_time: 2,
│                                 constant_forecast: 242, constant_offset: 121)
│   Coordinates:
│     * forecast_reference_time  (forecast_reference_time) datetime64[ns] 2022-12...
│     * constant_forecast        (constant_forecast) datetime64[ns] 2022-12-02 .....
│     * constant_offset          (constant_offset) timedelta64[ns] 06:00:00 ... 5...
│   Data variables:
│       model_run_path           (forecast_reference_time) <U29 'model_run/2022-1...
└── DataTree('model_run')
    ├── DataTree('2022-12-01T18:00:00')
    │       Dimensions:                  (forecast_reference_time: 1, time: 121,
    │                                     latitude: 220, longitude: 215)
    │       Coordinates:
    │         * longitude                (longitude) float64 -79.95 -79.86 ... -60.13 -60.04
    │         * latitude                 (latitude) float64 27.03 27.12 ... 47.32 47.41
    │         * time                     (time) datetime64[ns] 2022-12-02 ... 2022-12-07
    │         * forecast_reference_time  (forecast_reference_time) datetime64[ns] 2022-12...
    │       Data variables:
    │           wind_speed               (forecast_reference_time, time, latitude, longitude) float32 ...
    │           wind_from_direction      (forecast_reference_time, time, latitude, longitude) float32 ...
    │       Attributes: (12/178)
    │           ...
    └── DataTree('2022-12-12T18:00:00')
            Dimensions:                  (forecast_reference_time: 1, time: 121,
                                          latitude: 220, longitude: 215)
            Coordinates:
              * longitude                (longitude) float64 -79.95 -79.86 ... -60.13 -60.04
              * latitude                 (latitude) float64 27.03 27.12 ... 47.32 47.41
              * time                     (time) datetime64[ns] 2022-12-13 ... 2022-12-18
              * forecast_reference_time  (forecast_reference_time) datetime64[ns] 2022-12...
            Data variables:
                wind_speed               (forecast_reference_time, time, latitude, longitude) float32 ...
                wind_from_direction      (forecast_reference_time, time, latitude, longitude) float32 ...
            Attributes: (12/178)
                ...

I think that by placing each model run in the datatree, then accessor methods could be used for the various views that users want from a collection of forecasts. Thus methods along the lines of dt.fmrc.constant_offset("12H") to get values that are 12 hours from the time that a forecast was generated or dt.fmrc.best() to give a dataset with the best forecast data (least amount of time between generation and a forecasted time).

In trying to go from a quickly slapped together mock up to a post with slightly easier to understand example I ended up writing enough code tinkering that the mock up has started to become a reality: xarray_fmrc.

I know there has also been some exploration of using kerchunk.subchunk() and other ways to represent forecast collections, so I’d love to hear other thoughts.

rabernat · January 6, 2023, 4:00am

This is very cool @abkfenris! Our community has wrestled with the question of how to best to organize forecast data for a long time…but now we have Datatree! I think you’re heading in the right direction.

If you haven’t seen this Pangeo Forge thread, it might have some useful insights from @chiaral

github.com/pangeo-forge/staged-recipes

Example pipeline for GEFSv12

opened 04:14PM - 19 Jan 21 UTC

chiaral

proposed recipe

## Source Dataset These are “reforecasts” of the new GEFSv12 system. They are retrospective forecasts spanning the period 2000-2019. These reforecasts are not as numerous as the real-time data; they were generated only once per day, from 00 UTC initial conditions, and only 5 members were provided, with the following exception. Once weekly, an 11-member reforecast was generated, and these extend in lead time to +35 days. - Link to [documentation](https://noaa-gefs-retrospective.s3.amazonaws.com/Description_of_reforecast_data.pdf) - Link to [data AWS bucket](https://noaa-gefs-retrospective.s3.amazonaws.com/index.html#GEFSv12/reforecast/) - The file format is *grib2* - both the original and the cloud copy. - The files have multiple dimensions because the are forecast data (besides times/x/y/z they have lead time, ensemble members), they are organized as (from the PDF file linked above): - the directory tree structure under GEFSv12/reforecast/ is by year; - there are separate subdirectories for each yyyymmddhh, thus 2000010100 to 2000123100 for the year 2000. - Under each yyyymmddhh subdirectory, there are subdirectories c00, p01, p02, p03, p04 for the five individual member forecasts. Once per week, 11 reforecast members were computed, and the directories for those days extend through p10. - Individual grib files have file names such as “variable_yyyymmddhh_member.grib2” - Currently I played with loading some part of the files (CONUS precip/temp) in the AWS pangeo deployment using xarray using the rasterio engine and concatenating them, but it's veeeeeery slow and I loose info from the grib files because I don't use a grib engine: ``` import s3fs import xarray as xr s3 = s3fs.S3FileSystem(anon=True) def preprocessing_function_ptc(path_loop, im): prova = ['/'+ifg for ifg in path_loop.split('/')[1:]] ds = xr.open_rasterio('https://noaa-gefs-retrospective.s3.amazonaws.com'+''.join(prova), chunks={'x':'200MB', 'band':-1}) ds = ds.sel(y = slice(51,20), x = slice(229, 302)) start_time = pd.to_datetime(str(path_loop.split('/')[4]),format='%Y%m%d%H') ds.coords['time'] = start_time ds = ds.expand_dims('time') ds.coords['member_id'] = im ds = ds.expand_dims('member_id') return ds ``` ``` path_b = 's3://noaa-gefs-retrospective/GEFSv12/reforecast/*' lglob = s3fs.S3FileSystem.glob(s3,path=path_b) for ilyear in lglob[0:4]: print(ilyear) lglob2 = s3fs.S3FileSystem.glob(s3,path='s3://'+ilyear+'/*') ds_p_all = [] ds_t_all = [] for ilyear2 in lglob2[0:4]: print(ilyear2) ds_p_all1 = [] ds_t_all1 = [] for member in ['c00','p01','p02','p03','p04']: lglob3 = s3fs.S3FileSystem.glob(s3,path='s3://'+ilyear2+'/'+member+'/Days:1-10/*') for ilg3 in lglob3: if 'apcp' in ilg3: ds_p = preprocessing_function_ptc(ilg3, member) ds_p_all1.append(ds_p) elif 'tmp_2m' in ilg3: ds_t = preprocessing_function_ptc(ilg3, member) ds_t_all1.append(ds_t) temp_p = xr.concat(ds_p_all1, dim='member_id') temp_t = xr.concat(ds_t_all1, dim='member_id') ds_p_all.append(temp_p) ds_t_all.append(temp_t) precip = xr.concat(ds_p_all, dim='time') print(precip) tmp_2m = xr.concat(ds_t_all, dim='time') ``` - There is no password. ## Transformation / Alignment / Merging How to combine forecast data in a zarr format is not entirely clear to me yet - meaning what type of chunking should be used or merging. Often times analysis are carried out both as a function of start times and lead times, so I don't think we can have one structure that makes everyone happy. But I will figure out what is the structure that makes more sense.  ## Output Dataset I am interested in transforming them in zarr format. It might be necessary to first download them and them opening them using a grib engine and then push them to the cloud, because as of now I couldn't use a grib engine.

TomNicholas · January 6, 2023, 4:44pm

This is great to see @abkfenris! I would love to help out however I can.

If you have any thoughts on what kind of API improvements on datatree would help you then I’m all ears. There is some discussion on API methods here. I actually added a .filter method just now which might be useful.

abkfenris · January 8, 2023, 7:22pm

I’ve gotten the accessor methods working at least for my few test cases, so dt.fmrc.best(), dt.fmcr.constant_offset("6h"), and dt.fmrc.constant_forecast("2023-01-10") all work now.

@TomNicholas I’d definitely love some help. I’m somewhat just using datatree as a pre-existing structure to jam model runs into and somewhat naively looping over them.

While filter might take care of this, it would also be nice to be able to filter nodes with broadcasted dataset methods.

Maybe an extra argument along the lines of dt.sel(time=dt, _filter=True) to catch KeyErrors and instead only return a subset of the tree with the selection.

@rabernat sometimes it seems like the only commonality amongst forecasts is that you can at least expect that all the data generated at the same time will be in the same place and be able to be used together.

That’s why I’m treating each one as it’s own group when storing in the tree. Only when you try to use a specific accessor method does it actually go in and try to get them to play together. At that point you should have enough information to filter out the unwanted duplicate data that would otherwise make it ugly to jam into a dataset.

TomNicholas · January 9, 2023, 7:25am

Amazing!

I can see how that would be useful. If I’m understanding correctly this is quite a similar suggestion to in this datatree issue, just you’re talking about ignoring errors from missing variables instead of ignoring errors from missing dimensions. ANother idea would be to add something like dt.sel(time=dt, errors="ignore") inspired by the option xarray’s .drop_vars already has.

@abkfenris in general any snippets of “I would like it if this code acted like this” is really helpful to me.

abkfenris · January 9, 2023, 3:12pm

I thought I had seen an issue along those lines, but I think I searched ‘coordinates’ and ‘exception’ (as technically I’m selecting a non-dimension coordinate in one case) so I didn’t find that one. dt.sel(time=dt, errors="ignore") would work great. I’ve got some more ideas that I’ll continue in the issue.

A method like .concat_child_datasets would also be helpful. Between those two it could simplify two of my methods from.

    def constant_offset(self, offset: Union[str, int, float, timedelta]) -> xr.Dataset:
        timedelta = pd.to_timedelta(offset)

        filtered_ds = []

        for child in self.datatree_obj["model_run"].children.values():
            try:
                selected = child.ds.sel(forecast_offset=timedelta)
                filtered_ds.append(selected)
            except KeyError:
                pass

        combined = xr.concat(filtered_ds, "time")
        combined = combined.sortby("time")
        return combined

to

    def constant_offset(self, offset: Union[str, int, float, timedelta]) -> xr.Dataset:
        return self.datatree_obj["model_run"].sel(forecast_offset=offset).concat_child_datasets("time").sortby("time")

I think it could allow other datatrees with similarly really closely related but non-alignable datasets to be quickly collapsed into aligned datasets with less boilerplate.

rabernat · January 9, 2023, 3:43pm

.concat_child_datasets is analogous to the .merge_child_datasets idea we discussed at the LEAP workshop last week. It seems like collapsing datatrees into datasets could be a broadly useful feature.

TomNicholas · January 9, 2023, 4:07pm

concat_child_datasets could be something like

def concat_child_datasets(self, **concat_options):
    collapsed_ds = xr.concat([node.ds for node in self.subtree], **concat_options)

    self.ds = collapsed_ds
    
    # then delete / orphan all the children of this node

    return self

if you’re interested in submitted a PR?.. Else I’ll try to get to it soon.

abkfenris · January 9, 2023, 9:30pm

I was thinking that concat_child_datasets would return a dataset rather than a datatree.

How about concat_children() and merge_children() that return single nodes as then datasets are also directly accessible from .ds?

kdl0013 · January 9, 2023, 10:16pm

You mentioned forecast data and python has a community for that climpred: verification of weather and climate forecasts — climpred documentation called climpred. It could be possible to collaborate with them to see how forecast data is currently processed and they might have an idea.

TomNicholas · January 9, 2023, 11:13pm

That’s what I was thinking too, I just didn’t express it very clearly. I’ve raised an issue to continue this discussion

abkfenris · January 11, 2023, 4:00pm

Hi @kdl0013 thanks for the reminder about climpred! I had forgotten about that project and it was interesting to explore what’s happened since I last checked it out.

Climpred has made things work by reshaping forecasts into init (forecast_reference_time, the time the forecast was ‘made’) and lead (integer offset with units encoded as an attribute) dimensions so that they can be aggregated into a single dataset.

So far, I’ve been largely approaching this problem with my NERACOOS data manager hat on. Most of our users (fishermen, Coast Guard, local communities) really just want the best available data and aren’t worried about individual model runs, but I’ve also got a handful of scientists who are interested in those. We are currently running ERDDAP and THREDDS servers, but don’t have full THREDDS FMRC serving set up (instead I’ve got something manually hacked together with separate datasets for each model run).

For us (and I’m guessing a lot of other orgs), it probably doesn’t make too much sense to change our model storage to be init x lead shaped with our normal usage patterns.

Since there is enough information to get init x lead (even with some changes I’m pondering below), I do think it would be reasonable to provide a method that could return a tree as a climpred compatible dataset.

Right now I’m capturing forecast_reference_time (or init) as a dimension and forecast_period (or lead) as a non-dimension coordinate (forecast_offset) by adding them to each dataset within the tree. forecast_reference_time is also captured in the path within the tree (model_run/{forecast_reference_time}). I’m also making a dataset at the root with that info.

Looking at how Kerchunk wants to do aggregations, it probably makes the most sense to avoid doing any reshaping of datasets with extra data. xarray-fmrc could depend on just the forecast_reference_time in the path and derive all the related info as needed for the reshaping it’s doing (well, as long as there is a reasonable time dimension, but I think even that is manageable). That way Kerchunk can refer to existing model data as is and we can still access them in the various FMRC-ish and climpred ways.

rsignell · May 26, 2023, 5:44am

Slightly off topic, but just in case it’s useful here in some way, here is my GEFS kerchunking notebook.

rbrady · May 30, 2023, 4:35pm

Love the approach you’re taking with datatree @abkfenris. When we developed climpred, datatree wasn’t around. In hindsight, we effectively created something like it with our nested dictionary structure to hold forecasts, observations, perfect model runs, etc. We also hacked together our classes to apply the same function across all objects (climpred/classes.py at main · pangeo-data/climpred · GitHub I think under __getattr__). It would be great to make it so climpred accepts datatree objects. Although I left for an industry job without much time/legality for open-source work, as did @aaronspring I believe. Would be great for others to pick up the torch if the package still provides value for the community.

Topic		Replies	Views
Welcome, I need some support for the design of a forecast archive with Zarr Data	10	1187	April 23, 2022
Icepyx - Python tools for ICESat-2 data ICESat-2	11	3915	August 12, 2020
Wednesday February 1st: Xarray-Datatree: Hierarchical Data Structures for Multi-Model Science Pangeo Showcase	0	575	February 27, 2023
ZCollection: a library for partitioning Zarr dataset Data	14	1328	April 11, 2022
First 2023 Pangeo showcase at the Feb 1 community meeting! News & Announcements	1	1042	January 27, 2023

Xarray and collections of forecasts

Related topics