Hi all,
I’m preparing to upload new EURO-CORDEX CMIP6 datasets to our bucket on AWS S3 and would love some advice on the best storage layout. It has been some time since i was last dealing with this (mostly through pangeo forge and our current recipe). This was all mostly based on the incredible efforts to bring CMIP6 datasets to the cloud, thanks a lot for all those efforts! I have read a lot around this forum and also the relevant github issues, but I know that there has been many advance around this topic, so i thought it would be best to get some feedback before i start refactoring my pipelines. Many many thanks in advance! So let me explain my…
Use Case
- Multiple models (ensemble members) on the same grid, same variables (e.g.,
tas
, pr
).
- Having an ensemble view with an extra
source_id
dimension for analysis (e.g., most users want to do something like ds.tas.mean('source_id')
) would be very convenient.
- Not every source has every variable at first → missing data should be
NaN
, filled later.
- Need versioning: new sources or variables should create new versions of the ensembel dataset, without breaking old ones.
So i was thinking about the following…
Options
A. One big Zarr dataset
- One array per variable, add
source_id
dimension.
- Missing = NaNs, later overwritten.
Simple, single dataset.
Less modular, lots of placeholders.
B. Per-source ARCO Zarr + kerchunk
- Each source in its own ARCO Zarr (all variables).
- Use
MultiZarrToZarr
to build combined reference JSON with source_id
.
Modular, no duplication.
Static JSON, must regenerate on updates.
C. Per-source ARCO Zarr + Icechunk
- Each source in its own ARCO Zarr (all variables).
- Icechunk builds combined
source_id
view with versioning 
Zero-copy, versioned, scalable.
Needs Icechunk infra.
I’m somehow biased to go with Option C because it allows us to have several versions/releases of the ensemble. But the most important question for me now is, how to apporach the storage layout…
Thanks a lot again!
Need versioning
Icechunk is the only option here that supports versioning.
Needs Icechunk infra.
There is no “Icechunk infra”! That’s what so great about it. It’s just objects in S3. Earthmover’s platform (arraylake) has infra (that we run for you), but Icechunk is entirely open-source, standalone and serverless (in the sense that Icechunk does not require you running a server).
But the most important question for me now is, how to approach the storage layout…
I agree this is the part to think about.
Having an ensemble view with an extra source_id
dimension for analysis (e.g., most users want to do something like ds.tas.mean('source_id')
) would be very convenient.
You could definitely do this, essentially creating giant hypercubes for every variable with source_id
as a dimension. Or you could put each ensemble member in a different group in the store, and search over the groups. The former is perhaps slightly neater for users, the latter will likely be more resilient to adding more data or changing requirements later.
Note this all sounds very similar to what @jbusecke has been doing with CMIP6.
2 Likes
Thanks a lot Tom! Yes, our former approach was actullay very similar to the CMIP6 leap feedstock and it basically stored each dataset id, e.g., cordex.output.EUR-11.SMHI.MPI-M-MPI-ESM-LR.rcp85.SMHI-RCA4.r1i1p1.day.tas.v20180817
, as a single zarr store. As far as i understood also from reading through the discussions(e.g., Welcome, I need some support for the design of a forecast archive with Zarr), it’s not the ideal approach concerning performance if i often want to open and merge several datasets. Usually, i did that using an intake catalog search, open all datasets and merging them. But for example, in the ERA5 ARCO dataset, i get all surface variables in one dataset/zarr store which is very convenient and i would aim for something similar like that instead of storing each variable in a separate zarr store. To get an ensemble view, i could create a virtual dataset (e.g., all tas
from all models) that simply references the existing data. Would this be a good approach, that also allows me to update that virtual ensembel dataset when new models arrive? I think this option would be
D. per-source + frequency Zarr stores + virtual ensemble
- Store all variables for one
source_id
and one frequency (e.g., daily, monthly) in a single Zarr store.
- Then build a virtual ensemble dataset (VirtualiZarr / kerchunk / Icechunk) across
source_id
.
So i’m probably trying to leverage the “Put as much as you can into a single Zarr group / Xarray dataset” recommendation and “be flexible with extending the ensemble” ideas!
1 Like
Hey @larsbuntemeyer great to hear from you again! This is a very timely discussion IMO. I have been working a lot on virtual zarr stores (no native data just manifest arrays that point to legacy nc files) for the past months. I am also thinking (with some others at Carbonplan and LDEO) how to learn from our past CMIP6 efforts and bring the improved user perspective to CMIP7 data with all the new tech available right now. Besides the storage layout I would be curious about the chunking requirements.
Do you want to rechunk the data at all? Or are you planning to leave the chunks as they are in the source files?
You discussed performance above but this
Usually, i did that using an intake catalog search, open all datasets and merging them.
does mostly mention the structure of the dataset metadata. If that is your main bottleneck you might actually be served well by a different layout!
E. Raw Netcdf Files on S3 + virtual icechunk stores per atomic dataset (single variable) + higher aggregated (all variables for frequency)
- Copy all relevant netcdf files to s3
- Create virtual icechunk stores (these are very small since they contain no data chunks)
- Since these stores are cheap to store you could experiment with all kinds of ‘aggregation’ levels ( I am always going back and forth on what is the best for users, but I have concluded that the only way to really find out is to give different choices to users and gather feedback hehe).
Happy to chat further about this. In my current contract with DevelopmentSeed and NASA Veda we found that this approach can actually be quite performant. If you want to have some examples you can check out this repo (a batch conversion) or for an AWS cdk deployment that uses branches to update virtual references see this repo.
1 Like
Since you mentioned an intake catalog I wanted to say that I am very interested in what any of these patterns will look like when cataloged in STAC. I have a highly experimental idea about what a big virtual icechunk store would look like as a collection-level asset (notebook, json) and how that would relate to the item-level archival files (netcdf for instance) (notebook). In the next few weeks I’ll be writing these ideas up more formally in the cloud native geo guide.
2 Likes
Thanks to both of you for all the resources, that’s very valuable!
Do you want to rechunk the data at all? Or are you planning to leave the chunks as they are in the source files?
Yes, in the past, i usually rechunked the data along the time axis at about 100MB. That seemed optimal for many user who want to do some seasonal analysis or comparison with observations. There are less users who want to extract certain small spatial subsets along long time axis (from which i understand would require more spatial chunking so that less data needs to be decompressed).
E. Raw Netcdf Files on S3 + virtual icechunk stores per atomic dataset (single variable) + higher aggregated (all variables for frequency)
That certainly is also an option, since i suspect many users also sometimes want to (still) download NetCDF files like they used to do with, e.g., ESGF.
I am always going back and forth on what is the best for users
Also a good thought. I have to get started now and ingest all your resources into my head and do some testing including the Option E… 
@jbusecke Always happy to chat, of course! 