Hi all,
I’m looking for guidance / best practices on writing an xarray object to (a collection of) COGs. Let’s start with a common case of a DataArray that’s indexed by (time, band, y, x)
. Let’s also assume that it’s a chunked DataArray, with a chunksize of 1 for time
and band
, and it might be chunked along y
and x
as well.
My high-level questions:
- Does rioxarray’s
.rio.to_raster(path, driver="COG")
have the right defaults? Is there anything special we should do to make sure we write “good” COGs for a single chunk? - Is there an established convention for organizing a directory of COG files that represent a 4-d datacube? Maybe some Open Data Cube folks have run into this?
I’m particularly interested in item 2, since I’d like to easily generate STAC Items for this collection of COGs. With a well-structured directory convention, we can easily group multiple STAC Assets (one per COG) that cover the same area in space and time into a STAC item. My proposed naming convention is
<prefix>/time=<time>/band=<band>-y=<y-coord>-x=<x-coord>.tif
This works well for xarray: we have coordinate information available when writing the chunk, so we can safely generate a unique name for a chunk using the (time, band, y, x)
coordinates of, say, the top-left value in the chunk.
To make things concrete, here’s a small proof of concept. Full code is at https://nbviewer.org/gist/TomAugspurger/06ea84bd63b19f86cd5560d9c6a8ff9d.
Suppose we have some DataArray with dimensions (time, band, y, x)
and enough metadata for rioxarray to know the CRS, extent, etc.
xcog
has a little utility for writing that DataArray out to a collection of COGs (one per chunk):
from pathlib import Path
import xcog
dst = Path("/tmp/cogs/")
dst.mkdir(parents=True, exist_ok=True)
items = xcog.to_cog_and_stac(data, prefix=str(dst), storage_options=dict(auto_mkdir=True))
That writes out these files:
$ tree /tmp/cogs/
/tmp/cogs/
├── time=2021-01-01T17:07:19.024000
│ ├── band=B02-y=4745100.0-x=399960.0.tif
│ ├── band=B02-y=4745100.0-x=454860.0.tif
│ ├── band=B02-y=4800000.0-x=399960.0.tif
│ ├── band=B02-y=4800000.0-x=454860.0.tif
│ ├── band=B03-y=4745100.0-x=399960.0.tif
│ ├── band=B03-y=4745100.0-x=454860.0.tif
│ ├── band=B03-y=4800000.0-x=399960.0.tif
│ ├── band=B03-y=4800000.0-x=454860.0.tif
│ ├── band=B04-y=4745100.0-x=399960.0.tif
│ ├── band=B04-y=4745100.0-x=454860.0.tif
│ ├── band=B04-y=4800000.0-x=399960.0.tif
│ └── band=B04-y=4800000.0-x=454860.0.tif
└── time=2021-01-04T17:17:19.024000
├── band=B02-y=4745100.0-x=399960.0.tif
├── band=B02-y=4745100.0-x=454860.0.tif
├── band=B02-y=4800000.0-x=399960.0.tif
├── band=B02-y=4800000.0-x=454860.0.tif
├── band=B03-y=4745100.0-x=399960.0.tif
├── band=B03-y=4745100.0-x=454860.0.tif
├── band=B03-y=4800000.0-x=399960.0.tif
├── band=B03-y=4800000.0-x=454860.0.tif
├── band=B04-y=4745100.0-x=399960.0.tif
├── band=B04-y=4745100.0-x=454860.0.tif
├── band=B04-y=4800000.0-x=399960.0.tif
└── band=B04-y=4800000.0-x=454860.0.tif
2 directories, 24 files
It also returned a list of STAC items with one STAC item per (time, y, x)
chunk. Notably, the separate bands have been merged into a single STAC item.
>>> items[0].assets
{'B02': <Asset href=/tmp/cogs/time=2021-01-01T17:07:19.024000/band=B02-y=4745100.0-x=399960.0.tif>,
'B03': <Asset href=/tmp/cogs/time=2021-01-01T17:07:19.024000/band=B03-y=4745100.0-x=399960.0.tif>,
'B04': <Asset href=/tmp/cogs/time=2021-01-01T17:07:19.024000/band=B04-y=4745100.0-x=399960.0.tif>}
Thanks to rioxarray and xarray’s coordinates, this has all the proj
information necessary for tools like stackstac, so kind of a round-trip from DataArray → {STAC + COG} → DataArray is possible:
(I noticed that I’m off by one on the size there).
Hopefully that example makes the use-case clear. Looking forward to any thoughts from people who have dealt with this before. Thanks!