I had a longer post written up in response to @mgrover1’s tweet, but I’ve trimmed it down and am posting here as a reply to this thread since there may be useful context earlier in this thread. I think the path forward is pretty clear though (see below)
First, I outline a few issues with the current setup of cataloging the Zarr CMIP6 data with a CSV:
Paper cuts
A small handful of folks have already made CMIP6 data way more useful with this “simple” setup of a big CSV plus some higher-level tooling for querying / filtering. That’s a huge deal, but there are some pain points, and we have other systems that can solve them.
1. Not a standardized metadata format
This CSV file with a links + a handful of fields isn’t a standard. I don’t have a great sense for how big a deal this is in practice (it’s clearly already useful), but all else equal I’d prefer to catalog data in a standardized format.
2. Python-centric
While not Python exclusive, this approach is certainly Python-centric. Again, I don’t know how much this matters, but I think there’s some value in having an ecosystem-neutral metadata standard with multiple implementations. I think having buy-in from multiple communities has even more value than an equivalent number of users from a single community. (That said, there might be other groups using these metadata files that I’m not aware of).
3. CMIP-specific
This format is essentially unique to CMIP-6 data. If a user wanted to combine CMIP-6 data with some other dataset, they’d be using one system to query CMIP-6 and another system to query the other data. I’m biased, but I don’t think any dataset is unique or valuable enough to merit its own bespoke system for querying and analysis.
4. Inefficient
Users need to download the entire metadata catalog to search for their data of interest. At 80 MB this isn’t the end of the world, but it’s not ideal. And at some scale, this metadata format would break down. By comparison, the size of the metadata for the Planetary Computer data catalog is in the 100s of GB or TB range.
My recommendation
I think STAC is the right metadata standard to use for this type of data. STAC solves all the pain points I listed above (and more).
- It’s a widely adopted standard with stakeholders from many communities and implementations in many programming languages and tools. It has a small, stable core but is easily flexible enough to catalog this type of data.
- It’s used in a variety of domains with data from a variety of institutions.
- With the STAC API, the inefficiency of shipping around an 80 MB CSV can be solved. You’d ingest the STAC metadata into a database and run a STAC API endpoint. Users can write there queries on any facet you catalog and the get back just the matching items. Unfortunately, running services isn’t really something pangeo is equipped to do, but perhaps we can work with the right groups (ESGF 2.0?) to make this happen.
We already have examples cataloging CMIP-6 downscaled products on the Planetary Computer. For example, the Planetary Computer from the Climate Impact Lab. See the full example, but the gist is:
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1/",
)
search = catalog.search(
collections=["cil-gdpcir-cc-by"],
query={"cmip6:source_id": {"eq": "NESM3"}, "cmip6:experiment_id": {"eq": "ssp585"}},
)
asset = search.item_collection()[0].assets["tasmax"]
ds = xr.open_dataset(asset.href, **asset.extra_fields["xarray:open_kwargs"])
ds
It’s a bit verbose, but I suspect we’ll see higher-level APIs built on top of this foundation. The raw metadata can be viewed at, e.g. https://planetarycomputer.microsoft.com/api/stac/v1/collections/cil-gdpcir-cc0/items?limit=1
Work items
If there’s sufficient interest in this, here’s a list of work items someone could take on:
- Decide on a data model for the assets in CMIP6
- As a rough heuristic, one STAC item per group of variables with the same dimensions / coordinates makes sense. Then you’d have one asset per variable within that, with each asset linking to the Zarr store for that variable.
- Translate the CMIP-6 controlled vocabulary to JSON Schema
- This gets us validation that the metadata in the STAC items (like the
activity_id
) is actual valid.
- I started on this at https://github.com/TomAugspurger/cmip6, but really it should be owned by someone affiliated with the CMIP project (and by someone who knows JSON schema)
- Generate the STAC metadata for each item
- Set up a STAC database and API
- We use pgstac and stac-fastapi
- This is a decent amount of effort and requires some expertise and paid work hours that the pangeo community (IMO) can’t just do on its own. There are vendors like Element84, Development Seed, and others who could get something started up pretty quickly though.