How do you store multi-dimensional arrays in a single tile in a COG?

@jsignell and Jarrett Keifer published a really great and thought-provoking blog post on the similarities, differences, and synergies between COG and Zarr. :heart_exclamation: at the message that people should focus on how to work together and leverage the value of all file formats :heart_exclamation:

A minor point that underpins one of the messages, that COG and Zarr are more interchangeable than commonly acknowledged, is that “COG does support multiple dimensions within an array as long as all have the same size and data type, and it can support multiple unrelated arrays”. My understanding of the dimensionality difference between these formats was that COG has chunk-equivalents than span 1 (strips) or 2 (tiles) dimensions, such that you couldn’t do the equivalent of this using COGs:

import xarray as xr

ds: xr.Dataset = xr.open_mfdataset("ERA5.zarr")
# Rechunk the data
ds = ds.chunk({"time": 50, "latitude": 50, "longitude": 50})
# Save to Zarr v3
ds.to_zarr("rechunked_ERA5.zarr", zarr_version=3)

Does COG support multidimensional arrays where a single tile spans multiple indexes along more than 2 dimensions? If so, does anyone have an example of how to produce such a COG?

cc a few of our resident COG experts @Vincent_Sarago @Michael_Sumner @jsignell

4 Likes

I would say “3D”, not multi-dimensional.

A typical example is RGB, if we open a tiled dataset, access the first tile, and read at the dataset level we get all bands in one array, and the complementary support is there for write or update (because that is exactly what the internal RasterIO C++ function is, a generic handler with a consistent set of args).

dsn <- "https://raw.githubusercontent.com/OSGeo/gdal/refs/heads/master/frmts/wms/frmt_wms_bluemarble_s3_tms.xml"


library(reticulate)
py_require("gdal")
gdal <- import("osgeo.gdal")
ds <- gdal$Open(dsn)
ds$GetRasterBand(1L)$GetBlockSize()
## [1] 256 256
## at the dataset level we get an array read 
dim(ds$ReadAsArray(0L, 0L, 256L, 256L))
#[1]   3 256 256

(I’m using R there because it’s easier for me tooling wise, but happy to explore this at depth with Python. rasterio has a more “array-like” focus, but the fact is GDAL also has that, if you use Read on a dataset (rather than on a band) you get all bands, or the ones you specify).

How that data is interleaved in the source depends on the format specifics and options at creation time. I feel it’s a little bit slippery to say “GeoTIFF supports array”, but it really does in the sense of how the bytes are oriented in storage. So, yeah - good question! (within caveats)

The fact that pixel interleaving that is possible is also why per-band access has led to STAC working with single-band tiffs for the most part (imo). Because, using bands as a proxy for time or depth and the pixels are interleaved as is normal for RGB is a performance killer when you read per-band.

Here’s the signature for ReadAsArray from dataset level:

ReadAsArray(xoff = 0L, yoff = 0L, xsize = NULL, ysize = NULL, 
buf_obj = NULL, buf_xsize = NULL, buf_ysize = NULL, buf_type = NULL, 
resample_alg = 0L, callback = NULL, callback_data = NULL, interleave = "band", 
band_list = NULL)

here it is at band level

ds$GetRasterBand(1L)$ReadAsArray
<bound method Band.ReadAsArray of <osgeo.gdal.Band; proxy of <Swig Object of type 'GDALRasterBandShadow *' at 0x7fd7883c93e0> >>
 signature: (
   xoff=0, 
   yoff=0, 
   win_xsize=None, 
   win_ysize=None, 
   buf_xsize=None, 
   buf_ysize=None, 
   buf_type=None, 
   buf_obj=None, 
   resample_alg=0, 
   callback=None, 
   callback_data=None
)

awesome blog post btw, that’s fantastic - and that example comparing byte for byte is really helpful!

here’s code to take the ERA5 Zarr subset to a COG, interleaved by pixel. You can’t really open such a large dataset (too many bands) with GDAL classic, without using a special ZARR driver syntax to limit the bands. So I open as multdim, subset a view, and cast down.

#export GS_NO_SIGN_REQUEST=YES
#import os
#os.environ["GS_NO_SIGN_REQUEST"] ="YES"

from osgeo import gdal
gdal.UseExceptions()
dsn =  "/vsigs/gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3"
ds =  gdal.OpenEx(dsn, gdal.OF_MULTIDIM_RASTER)
## these aren't ordered in the way we expect because they are root group level not per-MDArray  
[x.GetSize() for x in ds.GetRootGroup().GetDimensions()]
##[721, 37, 1440, 1323648]

anarray = ds.GetRootGroup().GetMDArrayNames()[4]
anarray
# '100m_u_component_of_wind'

mdarr = ds.GetRootGroup().OpenMDArray(anarray)
## now we get an expected ordered set of dims
[x.GetSize() for x in mdarr.GetDimensions()]
##[1323648, 721, 1440]

##  time steps, all lat, all lon
view = mdarr.GetView("[0:4,:,:]")
cds = view.AsClassicDataset(2, 1) ## we want x, y
gdal.Translate("my.tif", cds, format = "COG", creationOptions=["BLOCKSIZE=128"])

## notice how this is interleaved by pixel, so each 128x128 block is 0:4 values first pixel, then second pixel etc. 
#gdalinfo my.tif

#Image Structure Metadata:
#  LAYOUT=COG
#COMPRESSION=LZW
#INTERLEAVE=PIXEL

I’m not absolutely sure about the internal storage, but I’ll keep exploring. When you Read() or ReadAsArray() with GDAL it unpacks the internal orientation, but you can in the API read raw compressed chunks (not exposed in the SWIG Python asfaik)

There are several ways to represent this in COG/TIFF. For these examples let’s assume TILE_WIDTH = 10 and TILE_HEIGHT = 10 (to create a Nx5x5 grid).

  • A single 50 band COG with either band or pixel interleaving depending on the access pattern. Each tile in the image contains all time dimensions, for a tile size of (50, 10, 10). Any additional IFDs are used to hold overviews.
  • Multiple COGs, each containing bands representing a slice of time. For example, if each COG holds 10 time slices our dataset would turn into 5 COGs with tile size of (10, 10, 10). Additional IFDs are used for overviews.
  • A single GeoTIFF (multi-page tiff) which stores each time slice as an IFD. GDAL describes these as “subdatasets”. For example, a single image with 5 subdatasets and a tile size of (10, 10, 10).

It seems to me that, as GDAL sees it, GeoTIFF supports multi-dimensional arrays through the use of “subdatasets” where IFDs are used to chunk the array across a 3rd dimension. However most COGs will not support this, as IFDs are reserved for representing overviews. Seems like GDAL did add support for subdatasets + overviews through the use of the SubIFD tiff tag, I wonder how widely used that is:

[see here] Starting with GDAL 3.2, read-only access to subdataset overviews and masks is possible provided that they are referenced by their parent IFD through the TIFFTAG_SUBIFD tag.

This is a great read on SubIFD - TIFF: IFD and SubIFD | Danny Berger

Noodling on a few different configurations, this first one uses each SubIFD to represent a chunk of the time dimension. All data within each time slice is stored in the same place on disk, such that a reader can fetch {"time": 10, "latitude": 10, "longitude": 50} in a single range request.

I have a few more examples to share but the new user restrictions are preventing me from attaching multiple images to a post. I tried to create multiple posts but am also limited to 3 replies per topic :frowning:

1 Like

This is all such great information, thank you both! I’m going to dive into this more in a bit.

I have a few more examples to share but the new user restrictions are preventing me from attaching multiple images to a post. I tried to create multiple posts but am also limited to 3 replies per topic :frowning:

I just increased your trust level manually, please let me know if you run into more problems. Unfortunately we do get spam so the restrictions are helpful since we’re not 24/7 moderators but admins can always manually elevate people as needed.

I’m not across the actual structures so it’s overdue that I do that, appreciate the details! I remember long ago encountering overviews stored as sub datasets, before overviews were supported (this was in Landscape Mapper in the 2000s, that was a weird file variant that the author hoped would be quietly ignored going forward). I assumed that overviews were just tacked on the end of the chunks that store the highest detail, but realise I have no idea how that’s done. I don’t think it’s a really great idea to bend the model too much, but certainly you could store any shape within one flat dimension that you index externally. A GeoTIFF could even be created empty, and use its latent tiles to refer to external chunks that only represent the first two dims, indexed by tile row,col. But then, what does that add to a simple set of numbers that define the chunking and (some latent) coordinates? I think a core problem here is that the underlying grid logic, chunks and (extent)or(coordinates) is an extremely simple and powerful abstraction that’s a bit too buried in Zarr, in GeoTIFF, and softwares generally.

I’m really appreciating the discussions going on, and sorry (please humour me a bit) to think out loud here. In GeoZarr I think the abstraction is a bit lost in the data-interpretation efforts (just as important , but completely seperable), and keep being reinvented as almost-standalones (griffine, rasterix, odc-geo, titiler, R’s raster/terra, there’s many).

1 Like

My takeaway from your comment @geospatialjeff is that COG does not natively support multi-dimensional arrays at all. You seem to be saying that the only ways to coerce a 3+ dimensional array into COG is to abuse one of COG/TIFF’s other features (bands, IFDs, or overviews), thereby colliding with them; or to distribute the array as chunks over multiple COGs, at which you’ve gone outside of the capabilities of a single COG and your collection starts to look like a poor-man’s Zarr store. Note also that none of your suggestions allow arbitrary choice of chunking along that 3rd dimension.

1 Like

This is correct. While it is supported by (Geo)TIFF, the COG spec does not provide enough details on how to navigate the differences between the representation of overviews and multi-dimensionality for a reader to make an informed opinion on how to interpret the data. I’d argue this is a good thing. COG is a specification that defines a cloud-friendly way to organize TIFF internals, it doesn’t dive into specific use cases. We are missing a flavor of TIFF (like OME) which describes how to store multi-dimensional array data inside a TIFF, or across several TIFFs. Which would be interpreted along with COG spec to provide cloud-friendly / multi-dimensional arrays via TIFF.

The layout in my first diagram could chunk along the 3rd dimension by splitting the time series across multiple COGs. The first COG representing time slices 0-10, second representing time slices 10-20 etc. This type of virtualization is encouraged even in the GDAL ecosystem with things like the WarpedVRT which allows multiple COGs to be interpreted as a single one.

That being said, “within COG” chunking along the 3rd dimension is achievable by organizing the SubIFDs along a different dimension, such that all chunks for the third dimension are organized (row major) within each tile. This approach would likely require something similar to Zarr’s “consolidated metadata” to reduce the overhead of having so many SubIFDs.

Thanks for the detailed explanations @geospatialjeff - your explanations of TIFF subtleties are interesting.

But if you did this you couldn’t choose to have no chunking (i.e. only 1 chunk) along that 3rd dimension right? So it’s still not general. And that’s something you might well want, especially for temporal coordinates.

We are missing a flavor of TIFF (like OME) which describes how to store multi-dimensional array data inside a TIFF, or across several TIFFs. Which would be interpreted along with COG spec

This approach would likely require something similar to Zarr

Or you could just… you know… :laughing:

2 Likes

Hahah it’s true, this does quickly start to look like “poor man’s Zarr” (as you say). I have personally never ran into a use case where, in practice, TIFF’s lack of support for n-dimensional data was an issue for me. In those cases I just use Zarr instead. This is highlighted by the fact that I consider myself a COG expert but have never considered how SubIFDs play into the format until this thread. Multi-dimensional COG is just not a use-case that has come up frequently for me; and that’s not because I have never worked with n-dimensional data.

Maybe the “COG as Zarr” idea presented in the E84 blog post is a more direct path to implement n-dimensionality in COGs compared to building a community extension like OME. COG as the storage format for each chunk, and Zarr to orchestrate the chunking across multiple COGs. The long-standing interoperability of TIFF (within the larger GIS ecosystem) paired with all the nice things provided by Zarr; like interoperability with xarray/dask/icechunk etc.

I guess the major blocker for broader adoption of Zarr as the n-dimensional data format for explicitly geospatial datasets is the GeoZarr specification. This thread, in my opinion, does a great job highlighting what I would argue is the main issue with GeoZarr’s current approach of defining an abstract data model that wraps all of Zarr / CF / CDM / TIFF etc. These things are inherently different; creating a single abstract data model that works for everything will be difficult and involves too much bike shedding.

2 Likes

What they did in the blog post was very clever, but IIUC it’s also basically what VirtualiZarr does. The only real difference is they avoided duplicating the (relatively tiny) chunk index at the cost of more manual effort, and a less generalizable approach that wouldn’t work for other formats that are not already cloud-optimized.

GeoZarr specification

I don’t want to speak to the GeoZarr efforts because I haven’t been following the details, but I do think your suggestion at CNG of “just copy the GeoTIFF spec straight into Zarr because we know that’s useful” has some merit.

enjoying this discussion and agree very much.

I do think that pixel-interleaving is a “more like” Zarr than band-interleaving, and you could craft cases that made sense for efficient access for time-aligned (or depth-aligned) but that 3rd dimension chunk would always be a singleton (all-bands, or 1-band) in single file. I think that’s worth mentioning, in a pedantic way, because it does affect access time so drastically (and was the default for multiband a long time, and really should only be used for truly coupled values like R,G,B, or u,v).

A minor point, WarpedVRT is a special variant and I would point to VRT per se (and potentially GTI, and STAC-GeoParquet as cases that can reference seamless mosaics as well though they also allow disparate crs as latent-sources like WarpedVRT does - in a reqular parent grid, and both transcend the inability of VRT XML/Dataset to scale well for many many sources).

That’s a good point - for this reason I’m not sure that band-interleaved is even virtualizable in the current Zarr model (or maybe it could be, but read throughput would then have to be very inefficient). @maxrjones might be able to confirm that though.

you mean “pixel-interleaved” is not virtualizable in the current Zarr model ? (I think, at least that’s my understanding from our convo). There’s other limitations like

if you have a GeoTIFF file using JPEG compression,
with the quantization tables being stored in the TIFF JpegTables tag
(i.e. shared for all tiles), which is the formulation that GDAL would
use by default on creation, then I don’t see how Kerchunk can deal with
that, since that would be 2 distincts chunks in the file, and the
recombination is slightly more complicated than just appending them
together before passing them to a JPEG codec.

I really need to find time to understand all the amazing components of this conversation :sweat_smile:

but just on the latest part about virtualization potential, both Chunky and Planar configurations (the raw TIFF spec version of pixel vs band interleaved IIUC) are implemented in GitHub - virtual-zarr/virtual-tiff: Produce and explore virtual Zarr with TIFFs. That library will be ready when VirtualiZarr v2.0 is released :laughing:

Chunky configuration did require writing a new Zarr codec though which I have yet to propose as a formal Zarr extension.

JPEG tables aren’t yet supported or compatible with the Zarr spec, but that could change with an extension. It seems like more of an edge case though so I’m not motivated to spend time on it though unless someone comes up with a compelling use-case.

1 Like