Thanks a lot Tom! Yes, our former approach was actullay very similar to the CMIP6 leap feedstock and it basically stored each dataset id, e.g., cordex.output.EUR-11.SMHI.MPI-M-MPI-ESM-LR.rcp85.SMHI-RCA4.r1i1p1.day.tas.v20180817
, as a single zarr store. As far as i understood also from reading through the discussions(e.g., Welcome, I need some support for the design of a forecast archive with Zarr), it’s not the ideal approach concerning performance if i often want to open and merge several datasets. Usually, i did that using an intake catalog search, open all datasets and merging them. But for example, in the ERA5 ARCO dataset, i get all surface variables in one dataset/zarr store which is very convenient and i would aim for something similar like that instead of storing each variable in a separate zarr store. To get an ensemble view, i could create a virtual dataset (e.g., all tas
from all models) that simply references the existing data. Would this be a good approach, that also allows me to update that virtual ensembel dataset when new models arrive? I think this option would be
D. per-source + frequency Zarr stores + virtual ensemble
- Store all variables for one
source_id
and one frequency (e.g., daily, monthly) in a single Zarr store. - Then build a virtual ensemble dataset (VirtualiZarr / kerchunk / Icechunk) across
source_id
.
So i’m probably trying to leverage the “Put as much as you can into a single Zarr group / Xarray dataset” recommendation and “be flexible with extending the ensemble” ideas!