STAC and Earth Systems datasets

Another short update here, Planetary Computer is a new collection in our staging deployment with a bunch of NetCDF files. We’re using STAC items to model a single year’s worth of data, so we have a STAC item per (model, scenario, year). Each Item has 9 assets, one per variable (which corresponds to a single NetCDF file). You can see a single item at, e.g., here.

I think this lines up pretty well with how CEDA is modeling things.

One important thing to note, with the flexibility of STAC and the STAC API, we’re able to support queries on custom dimensions. For example, this search gets all the items that come from the ACCESS-CM2 modelling group over a specific time range:

search = catalog.search(
    collections=["nasa-nex-gddp-cmip6"],
    datetime="1950/2000",
    query={"cmip6:model": {"eq": "ACCESS-CM2"}},
)
items = search.get_all_items()

I’m hopeful that this can cover most of the use cases that currently require a CSV + pandas (Accessing data in the cloud — Pangeo / ESGF Cloud Data Working Group documentation) to query for subsets of the files. For groups that already have a database and STAC API, serving those queries through the API feels a bit nicer.

2 Likes

Tom this seems like awesome progress!

Would you be willing to give a super-informal presentation about this at this Friday’s Pangeo / ESGF CMIP6 call? Working Group Meetings and Membership — Pangeo / ESGF Cloud Data Working Group documentation

That’d be great, if I could sometime in the first half of the meeting. I have another conflict at 10:30 Eastern.

1 Like

I had a longer post written up in response to @mgrover1’s tweet, but I’ve trimmed it down and am posting here as a reply to this thread since there may be useful context earlier in this thread. I think the path forward is pretty clear though (see below)

First, I outline a few issues with the current setup of cataloging the Zarr CMIP6 data with a CSV:

Paper cuts

A small handful of folks have already made CMIP6 data way more useful with this “simple” setup of a big CSV plus some higher-level tooling for querying / filtering. That’s a huge deal, but there are some pain points, and we have other systems that can solve them.

1. Not a standardized metadata format

This CSV file with a links + a handful of fields isn’t a standard. I don’t have a great sense for how big a deal this is in practice (it’s clearly already useful), but all else equal I’d prefer to catalog data in a standardized format.

2. Python-centric

While not Python exclusive, this approach is certainly Python-centric. Again, I don’t know how much this matters, but I think there’s some value in having an ecosystem-neutral metadata standard with multiple implementations. I think having buy-in from multiple communities has even more value than an equivalent number of users from a single community. (That said, there might be other groups using these metadata files that I’m not aware of).

3. CMIP-specific

This format is essentially unique to CMIP-6 data. If a user wanted to combine CMIP-6 data with some other dataset, they’d be using one system to query CMIP-6 and another system to query the other data. I’m biased, but I don’t think any dataset is unique or valuable enough to merit its own bespoke system for querying and analysis.

4. Inefficient

Users need to download the entire metadata catalog to search for their data of interest. At 80 MB this isn’t the end of the world, but it’s not ideal. And at some scale, this metadata format would break down. By comparison, the size of the metadata for the Planetary Computer data catalog is in the 100s of GB or TB range.

My recommendation

I think STAC is the right metadata standard to use for this type of data. STAC solves all the pain points I listed above (and more).

  • It’s a widely adopted standard with stakeholders from many communities and implementations in many programming languages and tools. It has a small, stable core but is easily flexible enough to catalog this type of data.
  • It’s used in a variety of domains with data from a variety of institutions.
  • With the STAC API, the inefficiency of shipping around an 80 MB CSV can be solved. You’d ingest the STAC metadata into a database and run a STAC API endpoint. Users can write there queries on any facet you catalog and the get back just the matching items. Unfortunately, running services isn’t really something pangeo is equipped to do, but perhaps we can work with the right groups (ESGF 2.0?) to make this happen.

We already have examples cataloging CMIP-6 downscaled products on the Planetary Computer. For example, the Planetary Computer from the Climate Impact Lab. See the full example, but the gist is:

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1/",
)
search = catalog.search(
    collections=["cil-gdpcir-cc-by"],
    query={"cmip6:source_id": {"eq": "NESM3"}, "cmip6:experiment_id": {"eq": "ssp585"}},
)

asset = search.item_collection()[0].assets["tasmax"]
ds = xr.open_dataset(asset.href, **asset.extra_fields["xarray:open_kwargs"])
ds

It’s a bit verbose, but I suspect we’ll see higher-level APIs built on top of this foundation. The raw metadata can be viewed at, e.g. https://planetarycomputer.microsoft.com/api/stac/v1/collections/cil-gdpcir-cc0/items?limit=1

Work items

If there’s sufficient interest in this, here’s a list of work items someone could take on:

  1. Decide on a data model for the assets in CMIP6
    • As a rough heuristic, one STAC item per group of variables with the same dimensions / coordinates makes sense. Then you’d have one asset per variable within that, with each asset linking to the Zarr store for that variable.
  2. Translate the CMIP-6 controlled vocabulary to JSON Schema
    • This gets us validation that the metadata in the STAC items (like the activity_id) is actual valid.
    • I started on this at https://github.com/TomAugspurger/cmip6, but really it should be owned by someone affiliated with the CMIP project (and by someone who knows JSON schema)
  3. Generate the STAC metadata for each item
  4. Set up a STAC database and API
    • We use pgstac and stac-fastapi
    • This is a decent amount of effort and requires some expertise and paid work hours that the pangeo community (IMO) can’t just do on its own. There are vendors like Element84, Development Seed, and others who could get something started up pretty quickly though.
5 Likes