STAC and Earth Systems datasets

Hi all,

I’ve been thinking about STAC and Earth Systems datasets (primarily stored in
Zarr) a bit and wanted to share thoughts. Sorry for the length, but my recommendations are that:

  1. Pangeo should adopt (and extend) STAC to describe our datasets as STAC
    Collections (i.e. pangeo-forge should produce STAC collections of the data it
    processes).
  2. We should explore dynamically generating STAC Items for our datasets, if we
    can identify compelling use cases for item-level operations.

Just Enough STAC Background for Earth Systems folks

For those new to STAC, it’s a specification for how geospatial asset metadata is
structured and queried. Roughly speaking, there are “Collections”, which
describe an entire dataset like Landsat 8 Collection 2 Level-2, and “Items”,
which describe a single “snapshot” of that dataset at a specific spatial region
and time (e.g. this landsat item).

A single STAC Item will include assets, which are links to the actual data files
themselves. These are often COGs, but STAC is agnostic to the actual file format
being cataloged (which is good news for us).

Another important point, STAC is extensible. Different domains define
extensions for representing their data (e.g., electro-optical, SAR, pointcloud,
datacube).

The STAC Website does a good job laying out what STAC is and why it’s being
worked on.

Just enough “Earth Systems” for STAC folks

Earth Systems datasets are typically include several multi-dimensional, labeled
arrays. The variables will often share dimensions. For example, you might have the variables
prcp, tmin, tmax, all of which are have dimensions (time, y, x). You might also have something
like uncertainty measure measurements on each of those, which would be indexed by (time, y, x, nv), where nv is a dimension for something like (lower_bound, upper_bound).

The data model is probably best described by the NetCDF format. The
datasets often include metadata following the CF Conventions.

(pangeo folks, correct me if I got any of that wrong)

Earth Systems Data and STAC Collections

I think there’s a natural parallel these Earth Systems datasets and STAC
Collections (or Catalogs?). They cover some time space and time. And STAC is
flexible enough to model the additional fields that show up (things like
additional dimensions, coordinates, multiple variables, etc.).

I think this would cover the use cases covered by the current Pangeo Catalog,
which is based on intake. Namely, it allows for data providers to expose their
available datasets, and users can browse a data provider’s catalog. By hitching
to STAC, we (pangeo) get to benefit from work down by the larger STAC community
(we wouldn’t have to maintain our own static site generator, for example). And
because intake is so flexible, users could continue to use intake as a Python
API to the data exposed through a STAC catalog.

Earth Systems Data and STAC Items

Whether to (and how) to expose STAC items is less clear. What do people do with
STAC items?

  1. Find all items matching some query
    >>> stac = pystac_client.Client.open(...)
    >>> items = stac.search(bbox=boox, datetime=datetime)
    >>> print(items.matched())
    500
    >>> items = list(items.items())  # ItemCollection

This is used to great effect in libraries like stackstac to build an xarray
DataArray based just on STAC metadata, which avoids opening a bunch of files
just to read some metadata.

At least for Zarr (with consolidated metadata), this use case seems less
compelling to me. Users can already open the entire Zarr store quickly. That
said, it might be worth exploring, to enable workflows that build on multiple
datasets. For example, a user might be loading data from Landsat (stored in
COGs) and Daymet (stored in Zarr), and might want to have a single API for
loading the data from those two datasets at some region.

  1. Browse specific “scenes”: https://planet.stac.cloud/item/5k3UqPNLpDJMxoAfw1YUV9y9QsbZpgkBacBWwUJ9/3MxsQZbdxjScFVpNqiHrDSMjKgPQo9Uq1JYtn2CAwxwSj9F/sMSJpYrw6qjYkCm1EJhRCK1hhCMRyJhV8spzrYVRwuZmjssZuCJ9hGo9QriS4uMo?si=2&t=preview#11/29.567842/-95.911077
  2. Others, surely?

Dynamically generate STAC Items

This post is growing too long, so I’m going to skip this section. But I’ll note
that I think in theory we can dynamically generate STAC items in response to
user queries. The query would return a single STAC `Item` whose `assets` field
includes a URL that, when requested, returns the dataset filtered down to just
the data originally queried (something like xpublish). This is complicated, but
doable technically (I think).

That said, we don’t have to do this. You can have STAC collections without an items,
which I think would cover all the use cases handled by the current pangeo catalog.

Proposed Work Items

miscellaneous links

8 Likes

Thanks Tom for taking the time to write this up. Your overview is correct IMO. And I endorse your proposed work items.

FWIW, @charlesbluca put together some prototypical STAC collections for our legacy catalog here:

One thing that was always unclear to me was how best to link to Zarr data from STAC. We settled on the concept of “collection level assets”, but this always felt weird because the Zarr store is not a file but a directory. In the case of consolidated metadata, do you link to the consolidated metadata file itself, or to the top level directory of the group? These issues need to be fleshed out.

2 Likes

Thanks for the great writeup @TomAugspurger . Regarding the intake-STAC-Zarr connection see also Adding support for Zarr datasets · Issue #70 · intake/intake-stac · GitHub . Intake-STAC could use some additional hands on deck to continue being useful. I think of it as a very general STAC → Xarray connector, but likely needs some refactoring to make the most of new xarray plugins and clever libraries like stackstac.

2 Likes

Great point Scott. I would also note that intake-stac is a key layer if we care about backwards compatibility with our existing legacy intake catalog: we should be able to just swap the old catalog for the new, stac-based one with minimal changes to user code.

1 Like

One thing that was always unclear to me was how best to link to Zarr data from STAC. We settled on the concept of “collection level assets”, but this always felt weird because the Zarr store is not a file but a directory. In the case of consolidated metadata, do you link to the consolidated metadata file itself, or to the top level directory of the group?

That came up briefly at Add cube:variables definition by TomAugspurger · Pull Request #6 · stac-extensions/datacube (github.com). I think it’s just up to us to define URL paths and roles that work for our needs. This will require some iteration and coordination between the extensions, data providers, and client libraries (like intake-stac). For now, I recommend a system that supports getting converting a URL to the xarray Dataset that this STAC collection is describing.

Intake-STAC could use some additional hands on deck to continue being useful.

Yep, I’ll be sure to update intake-stac once Add cube:variables definition by TomAugspurger · Pull Request #6 · stac-extensions/datacube (github.com) is in.

2 Likes

Thanks for getting the discussion rolling on this Tom. As someone coming from the EO/STAC background I have been trying to better understand high level metadata usage and data access patterns for the “Earth Systems” user community. Currently, I see two main metadata access patterns that could be well served by using STAC to describe zarr archives.

  1. Allowing users a consistent search experience to discover and load archives based on Dimensions and Data Variables. The intake-esm motivation statement provides a good summary of this requirement.

  2. Generating static catalog pages with the equivalent of xarray.Dataset html display to facilitate data discovery.

In addition the Planetary Computer effort has special considerations as it will be exposing both large volumes of EO data formats and “Earth Systems” data as zarr archives so providing a uniform discovery endpoint could improve client library maintenance and interoperability.

The intake-esm efforts demonstrate that there is community desire for high level metadata search for multidimensional data archives. I would be very curious to hear more from “Earth Systems” users about how they want to find and load archive data. Are there common variable names where users want all archives with this variable name (temp)? I’m also assuming that given efforts for building intermediate data archives that we may see a proliferation of zarr archives with smaller spatial dimensions so that users may want to search for archives with coverage in a defined area as we often do in the EO world.

I’m also curious about whether effort should be focused on modeling archives in STAC or with continued focus on intake-esm. One notable advantage with STAC is that [intake-esm] is focused purely on Python users and STAC could enable search and discovery tooling for a broader ecosystem as the Zarr protocol is implemented in other languages.

Pressing forward with modeling multidimensional archives in STAC I think the “Proposed Work Items” capture a good path forward and that continuing to refine and extend the datacube extension is a good entry point. At the specification / technical level there are a few considerations to investigate

  1. There has been previous discussion about whether Zarr archives are best semantically represented as STAC Items or Collections. I think there are pros and cons to both approaches but a few things to consider. The STAC specification is focused on alignment with the OGC Features API. I’ll let someone with more OGC knowledge weigh in here but my understanding is that there is no entry point for search across Collections which might limit the functionality we could implement if archives are modeled as collections.

  2. The stac-api-spec is moving towards adopting the OGC Features CQL spec for filtering and search. I have done some cursory research but one outstanding question is if properties in the datacube's nested dictionary structure can be exposed as queryables and how the CQL predicates might support filtering using these nested structures.

At a high level I think it is important to clarify that we are focusing on capturing metadata for Zarr archives to facilitate archive discovery loading. In the EO world, due to the nature of our traditional data organization and formats we leverage STAC metadata to facilitate byte level access. The lighter self describing nature of zarr archives makes this unnecessary in most cases (I imagine there are edge cases where consolidated_metadata latency is very large but this seems less prevalent).

:slight_smile: Apologies, this is my first post on discourse so the number of links I could post is restricted.

2 Likes

Thanks for bringing up intake-esm. I wonder if @andersy005 or others could chime in on what features it has that would be missed based on what I’ve outlined here. One that comes to mind immediately is (in STAC terms) collection-level search:

In [5]: col_subset = col.search(
   ...:     experiment_id=["historical", "ssp585"],
   ...:     table_id="Oyr",
   ...:     variable_id="o2",
   ...:     grid_label="gn",
   ...: )

So you have a large collection of many related datasets, and filter them down according to some properties of each dataset. IIRC, collection-level search is an eventual goal of the STAC API, but isn’t really being worked on yet. I think in the meantime, libraries like intake-stac can do the search / filtering client side (like intake-esm).

Anything else?

2 Likes

Intake-ESM is a relative newcomer in this area. The real reference point for Earth System data search is the ESGF search API: The ESGF Search RESTful API — esg-search v4.7.19 documentation

I’m hoping some ESGF folks can chime about their search use cases.

1 Like

Hi everyone, I am just joining this thread so apologies if I miss some important points from earlier in the discussion. I am coming at this from an ESGF community perspective. ESGF is a distributed infrastructure which serves climate model output from CMIP5, CMIP6 and CORDEX amongst others.

As part of a wider initiative to re-engineer the software that makes up the federation, we want to move the search API towards a wider agreed community standard. We’ve looked at STAC and like it but it doesn’t quite fit our needs. If we can extend it to do so, that would be great so we want support such efforts. The link @rabernat pointed out above describes the existing API. We require faceted search capability based on a set of controlled vocabularies including for example with CMIP6, things like institution and experiment ids, variable name.

We need little beyond that and would be wary of trying to include too much in the STAC metadata itself when it can be extracted from the data itself. The search API has worked with a traditional model of HTTP file serving of netCDF data but we are interested in the use case for zarr.

2 Likes

Another really useful feature, missing from STAC at the moment, is free-text search i.e. q=. As I understand it, you are required to know the facet names and select one of those. Along with that, something which is imperative for any UI building on top of the STAC API is facet discovery (e.g. Opensearch Description Document). From our perspective at CEDA as a data archive, we have many different datasets and collections which each have their own vocabulary and facets. A generic netCDF-like spec is a good start but different communities will want to be able to search using their own vocabulary. To be really useful as a search, the STAC API should be able to expose the relevant search facets and, ideally, a list of available values. I was looking at the query-spec as a place where this information could be provided.

I appreciate that there is a difference between STAC catalogs (static) and STAC API (dynamic). Your comment on dynamically generated STAC items is interesting. I have always come at it from the point of view that you generate and index your STAC items and they are static; requiring some further processing if you want to subset a time-series, for example.

1 Like

Thanks for sharing that feedback. I’m a bit behind on the STAC API side of things, but as I get up to speed I’ll keep that in mind. I’ve been approaching this with the simplest possible dataset in mind. I’ll keep things like free-text search in mind as we try to catalog larger and more complex datasets.

This is another thing I’m trying to figure out, and would appreciate input from the STAC community on. What are the best practices around “duplicating” information like this between files and STAC metadata? At least for COGs, we do include things like the proj:shape, proj:transform in the STAC metadata, even when it can be obtained from the file. This is extremely useful, since libraries like stackstac can build DataArrays from metadata, without having to open individual files (which is slow).

Contrast this situation with Zarr, where the metadata are just another lightweight json file for the whole cube!

Short update here:


One point of discussion on the item: My current plan is to expose the URL of the Zarr Dataset as a collection-level asset. Looking at STAC Browser, you’ll see a few assets like “Daily Hawaii Daymet HTTPS Zarr root”. The idea is to point to the root of the Zarr store. By convention, we can get various tools to understand the asset Roles. In that example I have roles like ["data", "zarr", "https"] and (["data", "zarr", "abfs"] for the Azure Blob FileSystem URL that would be used by adlfs). My next work item is to update intake-stac to look for collection-level assets with those roles. Then we get a workflow like

In [1]: import intake

In [2]: cat = intake.open_stac_catalog("https://raw.githubusercontent.com/TomAugspurger/xstac/main/examples/daymet/catalog.json")  # connect to top-level Catalog

In [3]: hi = cat['daymet-daily-hi']  # browse to specific Collection

In [4]: ds = hi.to_xarray(asset="https")

Edit: prototype for adding collection-level assets to intake-stac is at [WIP]: Load collection-level assets into xarray by TomAugspurger · Pull Request #90 · intake/intake-stac · GitHub

5 Likes