Metadata duplication on STAC zarr collections

Great post Amelia.

After working with both STAC and Zarr for several years, this challenge really resonates with me. I don’t have a good answer to your specific questions (that’s probably @TomAugspurger :wink: ) but I do have some general thoughts.

I think we have been thinking about Zarr and STAC all wrong.

STAC was designed to be a catalog for individual files (COGs or whatever), the “assets” attached to STAC items. This works very well for the use case for which it was intended.

In the meantime, in Pangeo, we started experimenting with putting Zarr into object storage. The CMIP6 cloud dataset is a good example:

In this project, we created hundreds of thousands of individual Zarr “stores” as we called them. Each one roughly corresponded to a netCDF file and was designed to be opened with Xarray.

So when we started thinking about integrating STAC and Zarr, we naturally thought of Zarr as a file format, akin to COG. I think this was a mistake.

Zarr is not a file format. Zarr IS a catalog! Zarr is much more analogous to STAC itself than it is to COG. Think about it. Zarr is basically an infinitely nest-able hierarchy of arrays, with metadata at every level. A Zarr group is like a STAC collection. A Zarr chunk is like a STAC item or asset. Yes, there are many differences in the details, but at a structural level, this is true.

If I were to create the CMIP6 cloud dataset over again, I would have put the entire thing into a single, deeply nested Zarr group.

Then the question would be–how can we create a shim to make a Zarr group act like a STAC catalog? Could we define a metadata standard to map Zarr directly to STAC? With Zarr V3 and its extension process, this becomes feasible.

IMO this is the fundamental way to resolve the problem of duplicated metadata. As long as STAC is unaware of the existence of Zarr metadata, this sort of duplication will be necessary. The only way around it is a deeper integration.

Not an easy problem to solve, but I believe it’s the right way forward.

2 Likes