Metadata duplication on STAC zarr collections

Hi everyone,

I am creating a STAC catalog to hold a number of datasets stored in zarr format, and I’ve been following examples that I find in Microsoft Planetary Computer’s data catalog to guide my decisions. I’ve noticed some redundancy in metadata fields (they are listed in two different locations for a dataset/collection), and I was wondering if any of you had any insights into the reasoning behind these choices - or if you have created your own zarr STAC collections and made different choices. I’ve been looking quite closely at this example because it matches the dataset I am working with in many aspects of its structure and metadata:

Here are some of my specific questions:

  • It appears as though the long_name field is stored in both (1) the standard ‘description’ field and (2) a custom field ‘attrs/long_name’ for variables/dimensions in the datacube extension
  • Similarly, the units field is stored in (1) the standard ‘unit’ field and (2) a custom field ‘attrs/units’ for variables/dimensions in the datacube extension
  • The projection information seems to be present in 3 locations (in sometimes varying forms).
    • The first is a variable called ‘Lambert_conformal_conic’
    • The second is in ‘attrs/grid_mapping’ as ‘lambert_conformal_conic’ for every datacube variable
    • The third is in the ‘reference_system’ field of the datacube dimension (for x and y) as a projjson

For each of these fields, I am curious to learn about the potential benefits of including these redundancies. And if I wanted to minimize duplicated data, is there a preferred location to store it as a default?

Thanks!
Amelia

2 Likes

Great post Amelia.

After working with both STAC and Zarr for several years, this challenge really resonates with me. I don’t have a good answer to your specific questions (that’s probably @TomAugspurger :wink: ) but I do have some general thoughts.

I think we have been thinking about Zarr and STAC all wrong.

STAC was designed to be a catalog for individual files (COGs or whatever), the “assets” attached to STAC items. This works very well for the use case for which it was intended.

In the meantime, in Pangeo, we started experimenting with putting Zarr into object storage. The CMIP6 cloud dataset is a good example:

In this project, we created hundreds of thousands of individual Zarr “stores” as we called them. Each one roughly corresponded to a netCDF file and was designed to be opened with Xarray.

So when we started thinking about integrating STAC and Zarr, we naturally thought of Zarr as a file format, akin to COG. I think this was a mistake.

Zarr is not a file format. Zarr IS a catalog! Zarr is much more analogous to STAC itself than it is to COG. Think about it. Zarr is basically an infinitely nest-able hierarchy of arrays, with metadata at every level. A Zarr group is like a STAC collection. A Zarr chunk is like a STAC item or asset. Yes, there are many differences in the details, but at a structural level, this is true.

If I were to create the CMIP6 cloud dataset over again, I would have put the entire thing into a single, deeply nested Zarr group.

Then the question would be–how can we create a shim to make a Zarr group act like a STAC catalog? Could we define a metadata standard to map Zarr directly to STAC? With Zarr V3 and its extension process, this becomes feasible.

IMO this is the fundamental way to resolve the problem of duplicated metadata. As long as STAC is unaware of the existence of Zarr metadata, this sort of duplication will be necessary. The only way around it is a deeper integration.

Not an easy problem to solve, but I believe it’s the right way forward.

2 Likes

I don’t know that there’s a “right” answer to this. In general I’d lean towards using the most specific spot possible, which is why I lifted things like the long_name from the attrs to the description (since I didn’t have anything else appropriate for a description).

Given that decision to use the most specific spot possible, I decided to keep the information in attrs (and so duplicating it) rather than deleting it so that clients could reconstruct the full Zarr / NetCDF metadata just from the STAC response. I’m not aware of anyone using this, but it seemed potentially useful.

As for the projection information, it’s possible I made some mistakes, but the dimension object accepts a reference_system field, which explains two of the three (I’m not sure why it’s put in both the “horizontal” dimensions. Presumably the projection in the x and y directions must always be the same?) The lambert_conformal_conic one is a regular Variable to catalog that “array” (scaler) in the dataset. Similar to precipitation, temp, etc.

In case it’s helpful, these were generated with xstac, specifically https://github.com/stac-utils/xstac/blob/main/examples/daymet/generate.py.

I think that kind of shim would essentially look like what we have today? Maybe a bit more like https://planetarycomputer.microsoft.com/dataset/cil-gdpcir-cc0, which has multiple items. You’d define some “level” of the nested Zarr tree to be a collection (perhaps multiple levels) with items below it? I think this is worth exploring.

I may be wrong, but I think the duplication of metadata may be unavoidable. For COGs, that’s clear: you have many individual assets you want to search over so you need to consolidate that metadata into a queryable database. I’m guessing it’s similar to many Zarr datasets. We saw that with the many NetCDF files in Planetary Computer. If you want to efficiently query on a property, you kind of need to put that metadata into a format that can be queried.

2 Likes

If you think of the role of STAC as being primarily about data discovery and zarr as being primarily about using the data then I think this duplication makes more sense.

It doesn’t have to be a bad thing as long as the STAC metadata is generated in a repeatable way. Tom, I get the sense that that is what xstac aspires to be right? The way to get STAC metadata from zarr?

3 Likes

Hi @rabernat. We might also consider that Zarr is a file format, but with its nest-able hierarchy facilitate the fusion with a STAC catalog. Simply at each level of the hierarchy, aside the .zgroup object, a .stac object might provide the corresponding STAC catalog.

1 Like

I am poking around to see if anyone is aware of any other work for generating a STAC collection from zarrs that I can read through. @TomAugspurger I am looking through your: GitHub - stac-utils/xstac: STAC from xarray

That’s what xstac is trying to achieve, so let me know if you run into any issues! xstac/examples/cil-gdpcir at main · stac-utils/xstac · GitHub should be an example making STAC items for a bunch of separate Zarr datasets (one per dataset) and xstac/examples/terraclimate at main · stac-utils/xstac · GitHub is an example for adding a Zarr dataset as a collection-level asset.