STAC and Earth Systems datasets

Hi all,

I’ve been thinking about STAC and Earth Systems datasets (primarily stored in
Zarr) a bit and wanted to share thoughts. Sorry for the length, but my recommendations are that:

  1. Pangeo should adopt (and extend) STAC to describe our datasets as STAC
    Collections (i.e. pangeo-forge should produce STAC collections of the data it
    processes).
  2. We should explore dynamically generating STAC Items for our datasets, if we
    can identify compelling use cases for item-level operations.

Just Enough STAC Background for Earth Systems folks

For those new to STAC, it’s a specification for how geospatial asset metadata is
structured and queried. Roughly speaking, there are “Collections”, which
describe an entire dataset like Landsat 8 Collection 2 Level-2, and “Items”,
which describe a single “snapshot” of that dataset at a specific spatial region
and time (e.g. this landsat item).

A single STAC Item will include assets, which are links to the actual data files
themselves. These are often COGs, but STAC is agnostic to the actual file format
being cataloged (which is good news for us).

Another important point, STAC is extensible. Different domains define
extensions for representing their data (e.g., electro-optical, SAR, pointcloud,
datacube).

The STAC Website does a good job laying out what STAC is and why it’s being
worked on.

Just enough “Earth Systems” for STAC folks

Earth Systems datasets are typically include several multi-dimensional, labeled
arrays. The variables will often share dimensions. For example, you might have the variables
prcp, tmin, tmax, all of which are have dimensions (time, y, x). You might also have something
like uncertainty measure measurements on each of those, which would be indexed by (time, y, x, nv), where nv is a dimension for something like (lower_bound, upper_bound).

The data model is probably best described by the NetCDF format. The
datasets often include metadata following the CF Conventions.

(pangeo folks, correct me if I got any of that wrong)

Earth Systems Data and STAC Collections

I think there’s a natural parallel these Earth Systems datasets and STAC
Collections (or Catalogs?). They cover some time space and time. And STAC is
flexible enough to model the additional fields that show up (things like
additional dimensions, coordinates, multiple variables, etc.).

I think this would cover the use cases covered by the current Pangeo Catalog,
which is based on intake. Namely, it allows for data providers to expose their
available datasets, and users can browse a data provider’s catalog. By hitching
to STAC, we (pangeo) get to benefit from work down by the larger STAC community
(we wouldn’t have to maintain our own static site generator, for example). And
because intake is so flexible, users could continue to use intake as a Python
API to the data exposed through a STAC catalog.

Earth Systems Data and STAC Items

Whether to (and how) to expose STAC items is less clear. What do people do with
STAC items?

  1. Find all items matching some query
    >>> stac = pystac_client.Client.open(...)
    >>> items = stac.search(bbox=boox, datetime=datetime)
    >>> print(items.matched())
    500
    >>> items = list(items.items())  # ItemCollection

This is used to great effect in libraries like stackstac to build an xarray
DataArray based just on STAC metadata, which avoids opening a bunch of files
just to read some metadata.

At least for Zarr (with consolidated metadata), this use case seems less
compelling to me. Users can already open the entire Zarr store quickly. That
said, it might be worth exploring, to enable workflows that build on multiple
datasets. For example, a user might be loading data from Landsat (stored in
COGs) and Daymet (stored in Zarr), and might want to have a single API for
loading the data from those two datasets at some region.

  1. Browse specific “scenes”: https://planet.stac.cloud/item/5k3UqPNLpDJMxoAfw1YUV9y9QsbZpgkBacBWwUJ9/3MxsQZbdxjScFVpNqiHrDSMjKgPQo9Uq1JYtn2CAwxwSj9F/sMSJpYrw6qjYkCm1EJhRCK1hhCMRyJhV8spzrYVRwuZmjssZuCJ9hGo9QriS4uMo?si=2&t=preview#11/29.567842/-95.911077
  2. Others, surely?

Dynamically generate STAC Items

This post is growing too long, so I’m going to skip this section. But I’ll note
that I think in theory we can dynamically generate STAC items in response to
user queries. The query would return a single STAC `Item` whose `assets` field
includes a URL that, when requested, returns the dataset filtered down to just
the data originally queried (something like xpublish). This is complicated, but
doable technically (I think).

That said, we don’t have to do this. You can have STAC collections without an items,
which I think would cover all the use cases handled by the current pangeo catalog.

Proposed Work Items

miscellaneous links

10 Likes

Thanks Tom for taking the time to write this up. Your overview is correct IMO. And I endorse your proposed work items.

FWIW, @charlesbluca put together some prototypical STAC collections for our legacy catalog here:

One thing that was always unclear to me was how best to link to Zarr data from STAC. We settled on the concept of “collection level assets”, but this always felt weird because the Zarr store is not a file but a directory. In the case of consolidated metadata, do you link to the consolidated metadata file itself, or to the top level directory of the group? These issues need to be fleshed out.

2 Likes

Thanks for the great writeup @TomAugspurger . Regarding the intake-STAC-Zarr connection see also Adding support for Zarr datasets · Issue #70 · intake/intake-stac · GitHub . Intake-STAC could use some additional hands on deck to continue being useful. I think of it as a very general STAC → Xarray connector, but likely needs some refactoring to make the most of new xarray plugins and clever libraries like stackstac.

2 Likes

Great point Scott. I would also note that intake-stac is a key layer if we care about backwards compatibility with our existing legacy intake catalog: we should be able to just swap the old catalog for the new, stac-based one with minimal changes to user code.

1 Like

One thing that was always unclear to me was how best to link to Zarr data from STAC. We settled on the concept of “collection level assets”, but this always felt weird because the Zarr store is not a file but a directory. In the case of consolidated metadata, do you link to the consolidated metadata file itself, or to the top level directory of the group?

That came up briefly at Add cube:variables definition by TomAugspurger · Pull Request #6 · stac-extensions/datacube (github.com). I think it’s just up to us to define URL paths and roles that work for our needs. This will require some iteration and coordination between the extensions, data providers, and client libraries (like intake-stac). For now, I recommend a system that supports getting converting a URL to the xarray Dataset that this STAC collection is describing.

Intake-STAC could use some additional hands on deck to continue being useful.

Yep, I’ll be sure to update intake-stac once Add cube:variables definition by TomAugspurger · Pull Request #6 · stac-extensions/datacube (github.com) is in.

2 Likes

Thanks for getting the discussion rolling on this Tom. As someone coming from the EO/STAC background I have been trying to better understand high level metadata usage and data access patterns for the “Earth Systems” user community. Currently, I see two main metadata access patterns that could be well served by using STAC to describe zarr archives.

  1. Allowing users a consistent search experience to discover and load archives based on Dimensions and Data Variables. The intake-esm motivation statement provides a good summary of this requirement.

  2. Generating static catalog pages with the equivalent of xarray.Dataset html display to facilitate data discovery.

In addition the Planetary Computer effort has special considerations as it will be exposing both large volumes of EO data formats and “Earth Systems” data as zarr archives so providing a uniform discovery endpoint could improve client library maintenance and interoperability.

The intake-esm efforts demonstrate that there is community desire for high level metadata search for multidimensional data archives. I would be very curious to hear more from “Earth Systems” users about how they want to find and load archive data. Are there common variable names where users want all archives with this variable name (temp)? I’m also assuming that given efforts for building intermediate data archives that we may see a proliferation of zarr archives with smaller spatial dimensions so that users may want to search for archives with coverage in a defined area as we often do in the EO world.

I’m also curious about whether effort should be focused on modeling archives in STAC or with continued focus on intake-esm. One notable advantage with STAC is that [intake-esm] is focused purely on Python users and STAC could enable search and discovery tooling for a broader ecosystem as the Zarr protocol is implemented in other languages.

Pressing forward with modeling multidimensional archives in STAC I think the “Proposed Work Items” capture a good path forward and that continuing to refine and extend the datacube extension is a good entry point. At the specification / technical level there are a few considerations to investigate

  1. There has been previous discussion about whether Zarr archives are best semantically represented as STAC Items or Collections. I think there are pros and cons to both approaches but a few things to consider. The STAC specification is focused on alignment with the OGC Features API. I’ll let someone with more OGC knowledge weigh in here but my understanding is that there is no entry point for search across Collections which might limit the functionality we could implement if archives are modeled as collections.

  2. The stac-api-spec is moving towards adopting the OGC Features CQL spec for filtering and search. I have done some cursory research but one outstanding question is if properties in the datacube's nested dictionary structure can be exposed as queryables and how the CQL predicates might support filtering using these nested structures.

At a high level I think it is important to clarify that we are focusing on capturing metadata for Zarr archives to facilitate archive discovery loading. In the EO world, due to the nature of our traditional data organization and formats we leverage STAC metadata to facilitate byte level access. The lighter self describing nature of zarr archives makes this unnecessary in most cases (I imagine there are edge cases where consolidated_metadata latency is very large but this seems less prevalent).

:slight_smile: Apologies, this is my first post on discourse so the number of links I could post is restricted.

2 Likes

Thanks for bringing up intake-esm. I wonder if @andersy005 or others could chime in on what features it has that would be missed based on what I’ve outlined here. One that comes to mind immediately is (in STAC terms) collection-level search:

In [5]: col_subset = col.search(
   ...:     experiment_id=["historical", "ssp585"],
   ...:     table_id="Oyr",
   ...:     variable_id="o2",
   ...:     grid_label="gn",
   ...: )

So you have a large collection of many related datasets, and filter them down according to some properties of each dataset. IIRC, collection-level search is an eventual goal of the STAC API, but isn’t really being worked on yet. I think in the meantime, libraries like intake-stac can do the search / filtering client side (like intake-esm).

Anything else?

2 Likes

Intake-ESM is a relative newcomer in this area. The real reference point for Earth System data search is the ESGF search API: The ESGF Search RESTful API — esg-search v4.7.19 documentation

I’m hoping some ESGF folks can chime about their search use cases.

1 Like

Hi everyone, I am just joining this thread so apologies if I miss some important points from earlier in the discussion. I am coming at this from an ESGF community perspective. ESGF is a distributed infrastructure which serves climate model output from CMIP5, CMIP6 and CORDEX amongst others.

As part of a wider initiative to re-engineer the software that makes up the federation, we want to move the search API towards a wider agreed community standard. We’ve looked at STAC and like it but it doesn’t quite fit our needs. If we can extend it to do so, that would be great so we want support such efforts. The link @rabernat pointed out above describes the existing API. We require faceted search capability based on a set of controlled vocabularies including for example with CMIP6, things like institution and experiment ids, variable name.

We need little beyond that and would be wary of trying to include too much in the STAC metadata itself when it can be extracted from the data itself. The search API has worked with a traditional model of HTTP file serving of netCDF data but we are interested in the use case for zarr.

2 Likes

Another really useful feature, missing from STAC at the moment, is free-text search i.e. q=. As I understand it, you are required to know the facet names and select one of those. Along with that, something which is imperative for any UI building on top of the STAC API is facet discovery (e.g. Opensearch Description Document). From our perspective at CEDA as a data archive, we have many different datasets and collections which each have their own vocabulary and facets. A generic netCDF-like spec is a good start but different communities will want to be able to search using their own vocabulary. To be really useful as a search, the STAC API should be able to expose the relevant search facets and, ideally, a list of available values. I was looking at the query-spec as a place where this information could be provided.

I appreciate that there is a difference between STAC catalogs (static) and STAC API (dynamic). Your comment on dynamically generated STAC items is interesting. I have always come at it from the point of view that you generate and index your STAC items and they are static; requiring some further processing if you want to subset a time-series, for example.

1 Like

Thanks for sharing that feedback. I’m a bit behind on the STAC API side of things, but as I get up to speed I’ll keep that in mind. I’ve been approaching this with the simplest possible dataset in mind. I’ll keep things like free-text search in mind as we try to catalog larger and more complex datasets.

This is another thing I’m trying to figure out, and would appreciate input from the STAC community on. What are the best practices around “duplicating” information like this between files and STAC metadata? At least for COGs, we do include things like the proj:shape, proj:transform in the STAC metadata, even when it can be obtained from the file. This is extremely useful, since libraries like stackstac can build DataArrays from metadata, without having to open individual files (which is slow).

Contrast this situation with Zarr, where the metadata are just another lightweight json file for the whole cube!

Short update here:


One point of discussion on the item: My current plan is to expose the URL of the Zarr Dataset as a collection-level asset. Looking at STAC Browser, you’ll see a few assets like “Daily Hawaii Daymet HTTPS Zarr root”. The idea is to point to the root of the Zarr store. By convention, we can get various tools to understand the asset Roles. In that example I have roles like ["data", "zarr", "https"] and (["data", "zarr", "abfs"] for the Azure Blob FileSystem URL that would be used by adlfs). My next work item is to update intake-stac to look for collection-level assets with those roles. Then we get a workflow like

In [1]: import intake

In [2]: cat = intake.open_stac_catalog("https://raw.githubusercontent.com/TomAugspurger/xstac/main/examples/daymet/catalog.json")  # connect to top-level Catalog

In [3]: hi = cat['daymet-daily-hi']  # browse to specific Collection

In [4]: ds = hi.to_xarray(asset="https")

Edit: prototype for adding collection-level assets to intake-stac is at [WIP]: Load collection-level assets into xarray by TomAugspurger · Pull Request #90 · intake/intake-stac · GitHub

5 Likes

At a “CMIP6 Cloud Discussion” with various ESGF participants this past Friday, some of the CEDA group presented on their latest thinking and coding around STAC. I noted that much/most of this has not been registered on this thread, and thought it worth bringing up, in an effort to coordinate between these efforts and @TomAugspurger’s ongoing work.

(In addition to being a messenger, my interest here is in trying to figure out how we in the overlapping LDEO / Pangeo Forge communities can best support, contribute to, and make use of these evolving tools. As posts on this subject tend to be, this is a longish one, so in advance to all for your patient consideration and feedback.)

Before touching on the three repos CEDA presented, I’d like to highlight a point I noted from @agstephens, who (paraphrasing here) mentioned: "At CEDA, a [STAC] Item has come to be seen as an individual meaningful object that scientist would want to find and use. This could be a single satellite image in the EO context, but in the ESM context, it could be a Zarr store.” (Ag, please correct me if I’ve misrepresented.)

I bring up this point in particular because it does seem to diverge from the view of a Zarr store as a “Collection-level asset”, as that term is to be used, e.g., by Tom in Collection Search · Issue #145 · radiantearth/stac-api-spec · GitHub. This notion of a Zarr store as a “Collection-level asset” carries through to, if I understand correctly, Tom’s choice to type a Zarr store as a pystac.Collection in xstac/_xstac.py at main · TomAugspurger/xstac · GitHub.

I’m new to all of this terminology, so very much welcome anyone to correct me, if these views are not in fact diverging, or perhaps if the divergence is less consequential than I’m imagining. As I currently see it, however, aligning around whether or not Zarr stores are Items or Collection-level assets seems significant.

This perhaps a good segue to the repos presented by @Richard_Smith, the first of which was GitHub - cedadev/item-generator (docs here: Item Generator — item_generator documentation). If I understand correctly (a common refrain for me on the subject of STAC), this aims to be a high-level abstraction for generating STAC objects. Perhaps eventually something like Tom’s xstac could even be plugged into it as a backend processor?

I do note that the lingering Item/Collection naming divergence seems to carry through to the name of this package. Richard, is the intention of this package to only generate STAC Items, or could any STAC object (including a Collection) be created with it?

The other two CEDA packages are:

To be honest, I am not clear what specific problems these two packages aim to solve and would greatly appreciate further clarification on that from Richard, Ag, or @philipkershaw. I sense it has something to do with faceted search, but my understanding doesn’t yet go much beyond that. If that’s correct, how do these packages relate (if at all) to Tom’s above-linked STAC feature request in Collection Search · Issue #145 · radiantearth/stac-api-spec · GitHub?

I’ll pause here to give others a chance to weigh in. Looking forward to getting up to speed on all of this and making further contributions once I do.

cc @rabernat

1 Like

Thanks for update from the CEDA side and the link to GitHub - cedadev/item-generator. Based on a quick glance, it and xstac do seem to have similar goals (though I don’t immediately see what the input to item-generator is; in the case of xstac it’s an xarray-readable dataset). And the output of course differs, whether it’s using an Item or a Collection.

I suppose that exposing a Zarr dataset as a either a collection-level or item-level asset is fine (here I’m assuming that cedadev/item-generator is making Items with Assets pointing to the Zarr dataset). In my case I gravitated towards a Collection since my primary use-case was generating an HTML catalog, and these Zarr datasets will sit at a similar level to our current datasets. If item-level search were the primary use-case, then I could see an Item being a natural object to use. And you could wrap that single Item in a Collection and get both (we’re doing that with some upcoming datasets I think).

I don’t have a sense for which is better to use though, if either. It’d be good to hear from some STAC experts on this (maybe @sharkinsspatial has thoughts?). One potential downside of putting a whole Zarr dataset in an Item is that you can’t divide it any further. But I still haven’t come up with a compelling use-case for making a collection of STAC items from a Zarr dataset (like an item per variable per chunk), so that might not really be a downside.

This will all be a bit easier to talk about when these datasets are publicly released and available through our STAC API. I’ll be sure to follow up then.

2 Likes

A quick update that might be of interest: GitHub - stac-extensions/xarray-assets is a small STAC extension to facilitate going from a STAC Asset (essentially a link + media type) to an xarray Dataset.

This gives data providers a place to store keywords that are required or recommended to access the dataset. stac-extensions/xarray-assets (github.com) has an example where of a required keyword (storage_account when creating the Zarr mapper) and recommended keyword (consolidated=True to xarray.open_zarr).

I’m hopeful that when this is combined with intake-xarray, users will be able to just browse (or search?) a catalog, navigate to the data they want, and load it into xarray without needing any details beyond what that asset’s name is.

1 Like

Thanks @cisaacstern. Given that we have started talking about this to other people, I will aim to put together a blog post that will go into the thinking about some of this stuff and share it here when it is ready. For now, here are my thoughts. (turns out this is looking more and more like the eventual blog post!)

CEDA Sofware Packages

At CEDA we are coming from the perspective of building an operational system that works with ingest. We currently process ~200-300k files/day into the CEDA archive. To serve various applications, we have an Elasticsearch index of all the files in the archive, where they are, and some basic metadata about them. These are indexed using RabbitMQ to provide a stream of file paths and a modular indexer which can be scaled up on Kubernetes, or make use of our Lotus batch cluster, scaling to 100s of indexers if we need to do a big run. This is the background to this work.

We are very much experimenting and seeing what sticks at the moment, so things could change, but the idea is that we will be dealing with a stream of data object identifiers. These could be traditional file paths on POSIX or object store URLs or even referring to data held on tape storage. All of these need to be processed to produce assets, items and collections.

We started thinking about top-down approaches where you could summarize the data in a location (tape, disk, object store) but this doesn’t work well with a stream of incoming data and the need to constantly update this information, so the approach switched to bottom-up.

The asset-extractor aims to just gather basic information about files and objects which could be useful when presenting to the user (e.g. size, checksum, location). The item-generator aims to use the same file path/object URIs (from now on called URIs) as the asset-extractor but tries to populate the properties field and other attributes of a STAC item. The item-generator is currently focused on items, but we haven’t yet thought about the mechanism for generating collections. It is likely that this would remain top-down, as an aggregation of the items based on some gathering rule (yet to be conceived). Both of these libraries are designed to work on atomic files and expect a file path and will output a dictionary. The asset-scanner provides the framework for these other two libraries to operate in, allowing you to write a configuration file to define the inputs and outputs. The input plugins provide a stream of URIs and the outputs decide what to do with the dictionary which comes out the other end.

In our use case, we have many different types, sources, formats and structures of data so needed a way to define a bespoke processing chain for the different groups of common data. Enter Item-descriptions. These YAML files allow you to define a specific processing chain for URIs which match against it hierarchically.

i.e. given the URI /badc/cmip6/data/CMIP6/CDRMIP/CSIRO/ACCESS-ESM1-5/esm-pi-CO2pulse/r1i1p1f1/fx/sftlf/gn/files/d20191119/sftlf_fx_ACCESS-ESM1-5_esm-pi-CO2pulse_r1i1p1f1_gn.nc processors defined in YAML file with following content would match

datasets:
 - /badc/cmip6

The framework is written to be extensible and allow for more processors to be added to generate the required content from the files. The item-description also defines the facets which are important when constructing a STAC item. These facets should be present and extractable from all assets you wish to group together. These facets are then used to generate a unique ID for assets contacting those facets. the receiving application for the output from this process needs to handle the aggregation of responses with the same ID into STAC items.

STAC in General

The ability to sub-divide the assets will depend greatly on the aggregated object. As @TomAugspurger mentioned, adding the Zarr store as a Collection level Asset would technically allow you to create lower level, searchable Items although what their Assets would be I am not sure (Zarr noob, but don’t think linking to a zarr sub-object as a standalone file would work? Might be off-base here). As Zarr and netCDF both allow lazy loading and subsetting with their clients, I am not sure there would be the need or you could pass the Asset list to a WPS service to provide you a subset (gradually becoming a substitute for OPeNDAP). As mentioned, we are of the opinion that the STAC types break down into:

  • Collection - A group of objects which can be described by a common set of “facets”. e.g. CMIP5, Sentinel 3, FAAM

    “The STAC Collection Specification defines a set of common fields to describe a group of Items that share properties and metadata.” - collection spec

  • Item - "At CEDA, a [STAC] Item has come to be seen as an individual meaningful object that scientist would want to find and use. This could be a single satellite image in the EO context, but in the ESM context, it could be a Zarr store.”

  • Asset - “An Asset is an object that contains a URI to data associated with the Item that can be downloaded or streamed. It is allowed to add additional fields.” - Asset Spec

This approach works nicely with item-search as you could search for your datasets with the current API specification. As collections provide an aggregation of item properties within the “summaries” attribute, you could see that collection search could be really useful to find parents of similar data and could massively reduce the search result clutter when querying 100,000s or 1,000,000s of items.

This is a lot of information but hopefully clarifies our thinking and helps the discussion. I will have a go at expanding on this and post the link here, once it is ready.

2 Likes

Here is the post I promised: Search Futures | CEDA Developer Blog

I have essentially expanded on the post above to include more background, links to external information (was limited to only 2 in the above post) as well as plans for the future.

2 Likes

Thanks for writing that up. I’ve shared it with some folks from the STAC community to get their thoughts.

Defining an Item as

Item - An individual meaningful object that scientist would want to find and use

is really interesting. I’ll need to think on that a bit more.

A time-series made up of a lot of individual netCDF files might make sense as Items with an aggregated object at the Collection level. Each file is stand alone. Changing to use Zarr formatted data (Zarr newbie so correct me if I am wrong, but…) the individual zarr objects cannot be used on their own and require their parent, Zarr store object to be useful.

I’ve been struggling with this too, but I think you’re essentially correct. A single Zarr chunk can be read from just the URL to the store + the chunk key, but that just gets you an ndarray. For the useful stuff, you’re also going to want to read the dimensions from the metadata and the coordinates from other arrays. With something like a COG, all that information is in a single file.

The datasets that motivated this are now cataloged on our site. A few links:

  1. STAC Collection for Hawaii Daymet at daily frequency: https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-daily-hi
  2. The HTML summary, generated from that STAC collection: Planetary Computer
  3. We did the same for TerraClimate (STAC, HTML).

A few things to note:

  • This models the datasets as Collections. My primary use case was generating STAC objects that could be cataloged at the same level as Landast, Sentinel, etc. at Planetary Computer
  • The STAC collections were generated using xstac which needs to be moved out of my personal GitHub, but should be somewhat functional
  • The STAC collection includes (duplicates) much of the data available in the Zarr store like the dimensions, coordinates, shapes, chunks. This is all used to build the HTML summary
  • We noticed a need for communicating fsspec / xarray-specific things, which we’ve written up as a small STAC extension call xarray-assets. The usage from a typical pangeo stack would be something like
>>> import fsspec, xarray, pystac
>>> collection = pystac.read_file("examples/collection.json")
>>> asset = collection.assets["example"]
>>> asset.media_type  # use this to choose `xr.open_zarr` vs. `xr.open_dataset`
'application/vnd+zarr'
>>> store = fsspec.get_mapper(asset.href, **asset.properties["xarray:storage_options"])
>>> ds = xarray.open_zarr(store, **asset.properties["xarray:open_kwargs"])
>>> ds

(or people would just use intake-stac, which would use this pattern internally). The main point is you can go from STAC → DataArray without having to know anything other than the URL to the Collection and the name of the asset.

4 Likes