I’ve been thinking about STAC and Earth Systems datasets (primarily stored in
Zarr) a bit and wanted to share thoughts. Sorry for the length, but my recommendations are that:
- Pangeo should adopt (and extend) STAC to describe our datasets as STAC
Collections (i.e. pangeo-forge should produce STAC collections of the data it
- We should explore dynamically generating STAC Items for our datasets, if we
can identify compelling use cases for item-level operations.
For those new to STAC, it’s a specification for how geospatial asset metadata is
structured and queried. Roughly speaking, there are “Collections”, which
describe an entire dataset like Landsat 8 Collection 2 Level-2, and “Items”,
which describe a single “snapshot” of that dataset at a specific spatial region
and time (e.g. this landsat item).
A single STAC Item will include assets, which are links to the actual data files
themselves. These are often COGs, but STAC is agnostic to the actual file format
being cataloged (which is good news for us).
The STAC Website does a good job laying out what STAC is and why it’s being
Earth Systems datasets are typically include several multi-dimensional, labeled
arrays. The variables will often share dimensions. For example, you might have the variables
tmax, all of which are have dimensions
(time, y, x). You might also have something
like uncertainty measure measurements on each of those, which would be indexed by
(time, y, x, nv), where
nv is a dimension for something like
(pangeo folks, correct me if I got any of that wrong)
I think there’s a natural parallel these Earth Systems datasets and STAC
Collections (or Catalogs?). They cover some time space and time. And STAC is
flexible enough to model the additional fields that show up (things like
additional dimensions, coordinates, multiple variables, etc.).
I think this would cover the use cases covered by the current Pangeo Catalog,
which is based on intake. Namely, it allows for data providers to expose their
available datasets, and users can browse a data provider’s catalog. By hitching
to STAC, we (pangeo) get to benefit from work down by the larger STAC community
(we wouldn’t have to maintain our own static site generator, for example). And
because intake is so flexible, users could continue to use intake as a Python
API to the data exposed through a STAC catalog.
Whether to (and how) to expose STAC items is less clear. What do people do with
- Find all items matching some query
>>> stac = pystac_client.Client.open(...) >>> items = stac.search(bbox=boox, datetime=datetime) >>> print(items.matched()) 500 >>> items = list(items.items()) # ItemCollection
This is used to great effect in libraries like stackstac to build an xarray
DataArray based just on STAC metadata, which avoids opening a bunch of files
just to read some metadata.
At least for Zarr (with consolidated metadata), this use case seems less
compelling to me. Users can already open the entire Zarr store quickly. That
said, it might be worth exploring, to enable workflows that build on multiple
datasets. For example, a user might be loading data from Landsat (stored in
COGs) and Daymet (stored in Zarr), and might want to have a single API for
loading the data from those two datasets at some region.
- Browse specific “scenes”: https://planet.stac.cloud/item/5k3UqPNLpDJMxoAfw1YUV9y9QsbZpgkBacBWwUJ9/3MxsQZbdxjScFVpNqiHrDSMjKgPQo9Uq1JYtn2CAwxwSj9F/sMSJpYrw6qjYkCm1EJhRCK1hhCMRyJhV8spzrYVRwuZmjssZuCJ9hGo9QriS4uMo?si=2&t=preview#11/29.567842/-95.911077
- Others, surely?
This post is growing too long, so I’m going to skip this section. But I’ll note
that I think in theory we can dynamically generate STAC items in response to
user queries. The query would return a single STAC `Item` whose `assets` field
includes a URL that, when requested, returns the dataset filtered down to just
the data originally queried (something like xpublish). This is complicated, but
doable technically (I think).
That said, we don’t have to do this. You can have STAC collections without an items,
which I think would cover all the use cases handled by the current pangeo catalog.
- Collect use cases for STAC & earth systems datasets (I have one, the Planetary
Computer catalog, but don’t want to focus too narrowly on that)
- Update the
datacubeextension to handle variables (https://github.com/stac-extensions/datacube/issues/1) ( Add cube:variables definition by TomAugspurger · Pull Request #6 · stac-extensions/datacube (github.com))
- Work with the STAC community to understand out to represent CF-Conventions in STAC metadata
- Write a tool to automatically generate STAC Collections from xarray / zarr / netCDF
- STAC / ESGF meeting: https://docs.google.com/document/d/1RJCBosTT7QcV3iWA3vDO-p53Z1jXzbOWbFolppf3Kvs
- General discussion on STAC, datacubes, and varibles: GitHub - radiantearth/stac-spec: SpatioTemporal Asset Catalog specification - making geospatial assets openly searchable and crawlable
- Proposed STAC / esm-collection-spec: Data Cube Extension: Variables and more · Issue #713 · radiantearth/stac-spec · GitHub