Generating COGs and STAC items from DataArrays

Hi all,

I’m looking for guidance / best practices on writing an xarray object to (a collection of) COGs. Let’s start with a common case of a DataArray that’s indexed by (time, band, y, x). Let’s also assume that it’s a chunked DataArray, with a chunksize of 1 for time and band, and it might be chunked along y and x as well.

My high-level questions:

  1. Does rioxarray’s .rio.to_raster(path, driver="COG") have the right defaults? Is there anything special we should do to make sure we write “good” COGs for a single chunk?
  2. Is there an established convention for organizing a directory of COG files that represent a 4-d datacube? Maybe some Open Data Cube folks have run into this?

I’m particularly interested in item 2, since I’d like to easily generate STAC Items for this collection of COGs. With a well-structured directory convention, we can easily group multiple STAC Assets (one per COG) that cover the same area in space and time into a STAC item. My proposed naming convention is

<prefix>/time=<time>/band=<band>-y=<y-coord>-x=<x-coord>.tif

This works well for xarray: we have coordinate information available when writing the chunk, so we can safely generate a unique name for a chunk using the (time, band, y, x) coordinates of, say, the top-left value in the chunk.

To make things concrete, here’s a small proof of concept. Full code is at https://nbviewer.org/gist/TomAugspurger/06ea84bd63b19f86cd5560d9c6a8ff9d.

Suppose we have some DataArray with dimensions (time, band, y, x) and enough metadata for rioxarray to know the CRS, extent, etc.

xcog has a little utility for writing that DataArray out to a collection of COGs (one per chunk):

from pathlib import Path
import xcog


dst = Path("/tmp/cogs/")
dst.mkdir(parents=True, exist_ok=True)

items = xcog.to_cog_and_stac(data, prefix=str(dst), storage_options=dict(auto_mkdir=True))

That writes out these files:

$ tree /tmp/cogs/
/tmp/cogs/
├── time=2021-01-01T17:07:19.024000
│   ├── band=B02-y=4745100.0-x=399960.0.tif
│   ├── band=B02-y=4745100.0-x=454860.0.tif
│   ├── band=B02-y=4800000.0-x=399960.0.tif
│   ├── band=B02-y=4800000.0-x=454860.0.tif
│   ├── band=B03-y=4745100.0-x=399960.0.tif
│   ├── band=B03-y=4745100.0-x=454860.0.tif
│   ├── band=B03-y=4800000.0-x=399960.0.tif
│   ├── band=B03-y=4800000.0-x=454860.0.tif
│   ├── band=B04-y=4745100.0-x=399960.0.tif
│   ├── band=B04-y=4745100.0-x=454860.0.tif
│   ├── band=B04-y=4800000.0-x=399960.0.tif
│   └── band=B04-y=4800000.0-x=454860.0.tif
└── time=2021-01-04T17:17:19.024000
    ├── band=B02-y=4745100.0-x=399960.0.tif
    ├── band=B02-y=4745100.0-x=454860.0.tif
    ├── band=B02-y=4800000.0-x=399960.0.tif
    ├── band=B02-y=4800000.0-x=454860.0.tif
    ├── band=B03-y=4745100.0-x=399960.0.tif
    ├── band=B03-y=4745100.0-x=454860.0.tif
    ├── band=B03-y=4800000.0-x=399960.0.tif
    ├── band=B03-y=4800000.0-x=454860.0.tif
    ├── band=B04-y=4745100.0-x=399960.0.tif
    ├── band=B04-y=4745100.0-x=454860.0.tif
    ├── band=B04-y=4800000.0-x=399960.0.tif
    └── band=B04-y=4800000.0-x=454860.0.tif

2 directories, 24 files

It also returned a list of STAC items with one STAC item per (time, y, x) chunk. Notably, the separate bands have been merged into a single STAC item.

>>> items[0].assets
{'B02': <Asset href=/tmp/cogs/time=2021-01-01T17:07:19.024000/band=B02-y=4745100.0-x=399960.0.tif>,
 'B03': <Asset href=/tmp/cogs/time=2021-01-01T17:07:19.024000/band=B03-y=4745100.0-x=399960.0.tif>,
 'B04': <Asset href=/tmp/cogs/time=2021-01-01T17:07:19.024000/band=B04-y=4745100.0-x=399960.0.tif>}

Thanks to rioxarray and xarray’s coordinates, this has all the proj information necessary for tools like stackstac, so kind of a round-trip from DataArray → {STAC + COG} → DataArray is possible:

(I noticed that I’m off by one on the size there).


Hopefully that example makes the use-case clear. Looking forward to any thoughts from people who have dealt with this before. Thanks!

8 Likes

Naming conventions are hard to design, especially for generic case. On one hand you want readability and compactness on the other hand you want to include enough information to have unique paths and possibly filenames too, you also want that information to be “glance-able by a human” (hence compactness is a plus).

I think for library/tool design you are better off starting with some sort of “path construction DSL”, then come up with reasonable default for that. This DSL could be as simple as Python format string with a bunch of properties made available to it, things like band name and various kinds of date/time representations and who knows what else, like keys from STAC properties.

Regarding specific examples in your post, I must say I’m not a fan :frowning:, sorry.

  1. I doubt a lot of people have a good idea what UTM coordinate -y=4800000.0-x=454860.0 corresponds to, also without knowing which UTM zone that is it’s not even possible to tell.
  2. Other problem with raw coordinates is that they can be negative, and in this case you end up with two kinds of meaning for - character: a separator and a negative sign. This also ruins column alignment as negative coordinates have extra character to them, you might want to balance that with a + for positive values, but + is problematic for URLs so is not a good idea to include that. And then there are lon/lat which require a lot of precision. I’d say coordinates in a file path are just not practical.
  3. Band is a primary property, so maybe replace band={band} with just {band}
  4. Using correct amount of precision for timestamps would be good, but might be hard without out-of-band information provided by an operator.

I personally prefer notation where file name is just {band}.tif and all other identifying information is in the directory path. The path is then either {time}/{space} or {space}/{time} with various levels of nesting for both {time} and {space} components. Problem with that notation however is that file paths are not unique so you can’t easily just download a bunch into a folder for debugging without renaming as you go.

Regarding “off by one” error: stackstac defaults to using top-left corner for your x/y coordinates, but rio-xarray probably expects pixel centers. So you might end up with off by half a pixel transform written to TIFF, then when stackstac loads it back it does “pixel grid snapping” and so wants to offset everything by half a pixel. I wish stackstac had xy_coords="center" as a default instead. I haven’t actually ran any of that it’s just a guess…

3 Likes

Also see this issue in stackstac

Thanks for the feedback!

On one hand you want readability and compactness on the other hand you want to include enough information to have unique paths and possibly filenames too, you also want that information to be “glance-able by a human”

This is a really important point. One additional thing to consider: now that we have STAC, maybe the balance should swing more towards uniqueness? In my mind, this pile of COGs would be easiest to explore / debug if you had STAC items for them. An initial prototype of this just used dask.base.tokenize on each chunk, which generated a unique string but was completely unreadable by humans. So the combination of time/band, y, x was a way to try to get uniqueness in a more human-friendly way.

My workflow was assuming blob storage, so I didn’t put much thought into nested directories vs. a flat namespace.

What is “human-friendly” depends on a workflow I suppose. One extreme is to just use UUID or hash and nothing else, this makes it super clear that filename is not to be interpreted directly. Other extreme is to include almost everything in the filename in a somewhat readable form, this only works for specific collections, see Landsat or Sentinel paths for examples of that. Mid point would be something like {band}-{place}-{time}-{hash} or {band}-{time}-{place}-{hash}, depending on what you value more “grouping by time” or “grouping by place”.

{time} is sort of easy, just need to figure out right level of precision, and support ranges and not just points in time. But {place} is much harder to do generically, it really only works for tiled data, and even then you need a naming scheme for those tiles. I’d say {place} will have to come from the user, maybe an attribute you can set on an xarray? And if it’s not provided by the user just omit it altogether and rely on {hash} for uniqueness. I guess you could encode a bounding box of the image into a filename but this will be even less readable and longer than a hash and harder to tab-complete too. If some sort of {hash} is too “unhuman”, one can substitute with some random words, like what docker does for container names.

I guess one could use one of those fancy things like H3 to encode {place} in a generic way, but that’s getting complicated quick. I’d say better option is to allow user to override those things via attributes or encoding properties of the xarray.

In my experience there is still benefit in avoiding “directories” with too many files even when storage medium is some sort of blob store. For example reading COG from a blob store with thousands of “sibling” blobs can be done quickly in software if you configure GDAL right, but sometimes you want to open it in QGIS and there is no obvious way to tell QGIS not to bother looking for side-car files, so QGIS will struggle at open time even though image is COG.