Hi all,
I’ve been thinking about STAC and Earth Systems datasets (primarily stored in
Zarr) a bit and wanted to share thoughts. Sorry for the length, but my recommendations are that:
- Pangeo should adopt (and extend) STAC to describe our datasets as STAC
Collections (i.e. pangeo-forge should produce STAC collections of the data it
processes). - We should explore dynamically generating STAC Items for our datasets, if we
can identify compelling use cases for item-level operations.
Just Enough STAC Background for Earth Systems folks
For those new to STAC, it’s a specification for how geospatial asset metadata is
structured and queried. Roughly speaking, there are “Collections”, which
describe an entire dataset like Landsat 8 Collection 2 Level-2, and “Items”,
which describe a single “snapshot” of that dataset at a specific spatial region
and time (e.g. this landsat item).
A single STAC Item will include assets, which are links to the actual data files
themselves. These are often COGs, but STAC is agnostic to the actual file format
being cataloged (which is good news for us).
Another important point, STAC is extensible. Different domains define
extensions for representing their data (e.g., electro-optical, SAR, pointcloud,
datacube).
The STAC Website does a good job laying out what STAC is and why it’s being
worked on.
Just enough “Earth Systems” for STAC folks
Earth Systems datasets are typically include several multi-dimensional, labeled
arrays. The variables will often share dimensions. For example, you might have the variables
prcp
, tmin
, tmax
, all of which are have dimensions (time, y, x)
. You might also have something
like uncertainty measure measurements on each of those, which would be indexed by (time, y, x, nv)
, where nv
is a dimension for something like (lower_bound, upper_bound)
.
The data model is probably best described by the NetCDF format. The
datasets often include metadata following the CF Conventions.
(pangeo folks, correct me if I got any of that wrong)
Earth Systems Data and STAC Collections
I think there’s a natural parallel these Earth Systems datasets and STAC
Collections (or Catalogs?). They cover some time space and time. And STAC is
flexible enough to model the additional fields that show up (things like
additional dimensions, coordinates, multiple variables, etc.).
I think this would cover the use cases covered by the current Pangeo Catalog,
which is based on intake. Namely, it allows for data providers to expose their
available datasets, and users can browse a data provider’s catalog. By hitching
to STAC, we (pangeo) get to benefit from work down by the larger STAC community
(we wouldn’t have to maintain our own static site generator, for example). And
because intake is so flexible, users could continue to use intake as a Python
API to the data exposed through a STAC catalog.
Earth Systems Data and STAC Items
Whether to (and how) to expose STAC items is less clear. What do people do with
STAC items?
- Find all items matching some query
>>> stac = pystac_client.Client.open(...)
>>> items = stac.search(bbox=boox, datetime=datetime)
>>> print(items.matched())
500
>>> items = list(items.items()) # ItemCollection
This is used to great effect in libraries like stackstac to build an xarray
DataArray based just on STAC metadata, which avoids opening a bunch of files
just to read some metadata.
At least for Zarr (with consolidated metadata), this use case seems less
compelling to me. Users can already open the entire Zarr store quickly. That
said, it might be worth exploring, to enable workflows that build on multiple
datasets. For example, a user might be loading data from Landsat (stored in
COGs) and Daymet (stored in Zarr), and might want to have a single API for
loading the data from those two datasets at some region.
- Browse specific “scenes”: https://planet.stac.cloud/item/5k3UqPNLpDJMxoAfw1YUV9y9QsbZpgkBacBWwUJ9/3MxsQZbdxjScFVpNqiHrDSMjKgPQo9Uq1JYtn2CAwxwSj9F/sMSJpYrw6qjYkCm1EJhRCK1hhCMRyJhV8spzrYVRwuZmjssZuCJ9hGo9QriS4uMo?si=2&t=preview#11/29.567842/-95.911077
- Others, surely?
Dynamically generate STAC Items
This post is growing too long, so I’m going to skip this section. But I’ll note
that I think in theory we can dynamically generate STAC items in response to
user queries. The query would return a single STAC `Item` whose `assets` field
includes a URL that, when requested, returns the dataset filtered down to just
the data originally queried (something like xpublish). This is complicated, but
doable technically (I think).
That said, we don’t have to do this. You can have STAC collections without an items,
which I think would cover all the use cases handled by the current pangeo catalog.
Proposed Work Items
- Collect use cases for STAC & earth systems datasets (I have one, the Planetary
Computer catalog, but don’t want to focus too narrowly on that) - Update the
datacube
extension to handle variables (https://github.com/stac-extensions/datacube/issues/1) ( Add cube:variables definition by TomAugspurger · Pull Request #6 · stac-extensions/datacube (github.com)) - Work with the STAC community to understand out to represent CF-Conventions in STAC metadata
- Write a tool to automatically generate STAC Collections from xarray / zarr / netCDF
miscellaneous links
- STAC / ESGF meeting: https://docs.google.com/document/d/1RJCBosTT7QcV3iWA3vDO-p53Z1jXzbOWbFolppf3Kvs
- General discussion on STAC, datacubes, and varibles: GitHub - radiantearth/stac-spec: SpatioTemporal Asset Catalog specification - making geospatial assets openly searchable and crawlable
- Proposed STAC / esm-collection-spec: Data Cube Extension: Variables and more · Issue #713 · radiantearth/stac-spec · GitHub
- https://api.weather.gc.ca/