Example which highlights the limitations of NetCDF-style coordinates for large geospatial rasters

Context

We have been working on the GeoZarr spec for quite some time. In the process, I have been confronted with the very different ways that the NetCDF world and the geospatial raster world (GeoTIFF) store the coordinates of data. For more background, see this issue

To summarize, here’s how it mostly works today:

  • NetCDF-style data (including Zarr data written by Xarray) stores all coordinates explicitly, i.e. longitude=[-180., -179.5, -179., ...]
  • Most GeoTIFF data uses the AffineGeoTransform from the GDAL data model.

Although Xarray originally emerged for working with weather / climate data stored in NetCDF, it has since become the de-facto standard container for gridded geospatial data in python, including traditional geospatial rasters.

The bridge between the geospatial raster / GDAL world and Xarray is RioXarray (which wraps rasterio and interfaces it with the Xarray data model). Rasterio implements the actual logic that is needed to bridge affine-transform-based coordinates with explicit NetCDF-style coordinates, by simply converting the geomtric affine transformation data to explicit numpy arrays:

For the past year, I’ve been hearing folks like @Michael_Sumner and David Blodgett of USGS explain that this approach is insufficient, as the explicit representation suffers from fundamental inaccuracy issues related to the finite precision of floating-point data types/

Here I’ve developed an example to convince myself that this is indeed a serious problem that needs fixing deep down in the software stack.

The Example - Big Tiled Data Cube

A pretty common use case in our community is to take Landsat / Sentinel data and build some sort of harmonized spatio-temporal datacube on the planetary, continental, or country scale. Because this is generally too big to put in a single GeoTIFF, folks have two options:

  1. Break the cube up into many smaller GeoTIFFs
  2. Store the cube as one big Zarr Array with reasonable sized chunks

This example is relevant to option 2

I am going to use odg.geo to define both the big data cube and the smaller sub tiles.

from odc.geo.geobox import GeoBox, GeoboxTiles
from odc.geo.xr import xr_zeros
import zarr
import xarray as xr
import warnings
warnings.filterwarnings('ignore')

dx = 1 / 3600  # 30m resolution
epsg = 4326
crs = f"epsg:{epsg}"
big_bounds = (-82, -56, -34, 13)  # South America
big_box = GeoBox.from_bbox(big_bounds, crs=crs, resolution=dx)
chunk_shape = 3600, 3600
big_ds = xr_zeros(big_box, chunks=chunk_shape).rename("big")
big_ds

Now let’s grab a sub tile

chunk_shape = 3600, 3600
tiles = GeoboxTiles(big_box, chunk_shape)

tile_idx = (0, 1)
tile_box = tiles[tile_idx]
tile_ds  = xr_zeros(tile_box, chunks=-1).rename("small")
tile_ds

This dataset should slot right into the parent array, right?

Wrong!

The coordinates don’t align exactly

try:
    big_ds.sel(latitude=tile_ds.latitude, longitude=tile_ds.longitude)
except KeyError:
    # expected since values are not numerically identical
    pass

# does work, but inexact
big_ds.sel(latitude=tile_ds.latitude, longitude=tile_ds.longitude, method="nearest", tolerance=1e-13)

# attempting to align causes nans to be filled into the dataset
big_ds_aligned, _ = xr.align(big_ds, tile_ds, join="right")
big_ds_aligned.load().plot()

image

Why does this matter? If I create a Zarr dataset from big_ds and then attempt to write tile_ds into via region="auto", it fails

memstore = zarr.MemoryStore()
big_ds.to_zarr(memstore, compute=False, consolidated=False)
tile_ds.to_zarr(memstore, region="auto")

# --> KeyError: "Not all values of coordinate 'longitude' in the new array were found in the original store. Writing to a zarr region slice requires that no dimensions or metadata are changed by the write."

Workarounds

There are many places in our stack where we could work around this problem (without solving it at its core). Two that come to mind are:

  1. We could use logical (rather than physical coordinates), e.g. tile_ds.to_zarr(memstore, region={"latitude": slice(0,3600), "longitude": slice(3600, 7200)}) or using the low-level Zarr API to write the data. This works today, but it requires more context. The information about these logical coordinates is not part of the Xarray dataset itself.
  2. odc.geo.geobox.GeoboxTiles could try to generate the coordinates more carefully, such that they exactly line up with the parent dataset.

These both ignore the fundamental problem: explicitly materializing these coordinates as floating point data is both inefficient and inaccurate.

A Possible Solution

A true solution would involve the following elements:

  • Xarray implements a RangeIndex which is generated by essentially three parameters: start, stop, step. This index would support alignment, subsetting, etc. in a consistent way.
  • We define a way to serialize these indexes in a simple way that only involves storing a few numbers (rather than a big materialized array) and implement the necessary encoding / decoding pathways in Xarray.
    • Eventually this could be proposed as CF convention for implicit coordinates. However, a working prototype should come first.

For an MVP, we could limit this to rectilinear affine transforms, which is simpler and separable dimension by dimension.

I’d be very eager to get feedback from @Alex_Leith and @kirill.kzb on this proposal. It seems like solving this problem would resolve a lot of the limitations in using Xarray (and by extension, Zarr) for large geospatial rasters.

11 Likes

ENH: Generalize RangeIndex to support floats · Issue #46484 · pandas-dev/pandas · GitHub is the upstream issue in pandas to support floating-point start, stop and step. I wonder whether that will be sufficient, or whether we need Decimal-style, arbitrary-precision support.

    • Eventually this could be proposed as CF convention for implicit coordinates. However, a working prototype should come first.

In Represent (coordinate) variables "symbolically" · Issue #361 · fsspec/kerchunk · GitHub @dcherian pointed me to a CF convention which may already be what we need. I read it a few times and I think Deepak is right (not that I doubted him), but it’s a bit more complicated than just storing start, stop, step in the metadata.

1 Like

Tom, thanks for sharing! It looks like the Pandas PR is two years old? What are the odds of it moving forward?

Also relevant Xarray issue

Looks like @benbovy has been working on this.

Now that i look at it again, i don’t think it supports arbitrary step. We’d have to save an array with constant value step (which does compress really well, so maybe not bad?)

EDIT: That’s not right. step is just the slope for the linear interpolator, so maybe it all works?

Hey @rabernat thanks for the shoutout!

I think @kirill.kzb is much better placed than me on this stuff, I struggle with the details a bit :slight_smile:

Shoot me an email if you’d like me to facilitate something with the ODC community though, I can bring a few Australians along for a conversation.

This issue comes up every now and then:

It’s important to understand that GeoBox is recomputed from coordinates, that’s needed to support slicing into geo-referenced data, and also to support data constructed by other libraries. BUT there is no guarantee that this recomputed GeoBox will produce exactly the same coordinates when used to create a new array from it. Essentially GeoBox -> coords -> GeoBox is not guaranteed to be lossless. It can only be lossless when both resolution and translation components of the Affine matrix are basically integers. Sentinel-2 has that property for example, scale=+/-10 tx,ty=10*N where N is an integer.

Not sure what a proper solution for this should be. We can keep track of the original resolution in an attribute of the coordinate, and use exactly that value when extracting GeoBox from coords (with a check to deal with xx[::10, ::10] type of slicing. I guess we can also keep track of original translation, and only recompute that if array has been sliced. I’ll probably implement that for the next version of odc-geo, actually.

The problem of creating sub-geobox that will be able to produce exactly the same coords as the original geobox in the sliced section is much harder to address. We want this invariant:

gbox[roi].coords == gbox.coords[roi]

That’s not possible unless gbox[roi] retains parent and slice, and essentially return self.parent.coords[self.roi]. Or we implement “rounding to some fraction of a pixel” kinda logic.

I am suggesting we make a rather deep change to Xarray in which we are no longer storing coords explicitly as an array of floating point values. Basically keeping the native GeoBox representation attached to the dataset as a sort of virtual coordinate, rather than requiring explicit coercion of the transform to a materialized array.

Could you clarify what you mean here? When does the “recomputing” happen? And what do you mean by “recomputed”?

It’s recomputed when you access it via xx.odc.geobox or xx.rio.transform(), as far as I can tell. That’s also what happens when saving to COG.

It would be great if it were easy to provide to xarray with “compute mapping from pixel index to physical coords of the center on the fly”, for geo-spatial it would need to handle 2d->2d mapping of those.

I am suggesting we make a rather deep change to Xarray in which we are no longer storing coords explicitly as an array of floating point values. Basically keeping the native GeoBox representation attached to the dataset as a sort of virtual coordinate, rather than requiring explicit coercion of the transform to a materialized array.

that would be great, but the fundamental issue of figuring out pixel correspondence between two different rasters with very similar but not quite exactly the same coordinate system will remain though.

1 Like

I’m delighted to read this, and appreciate all comments. I have a lot to say on this and have been working on materials to tell a few stories, and this makes that effort a lot easier to contend with. I’ll share some more thoughts soon.

I’d love to be involved in ongoing discussions and I will at least be a heavy tester of changes.

This is what I think would be helpful in xarray to support GeoBox use-case.

Say you have 4d array with dimensions TYXB, an RGBA datacude of several timestamps. The YX, dimensions 1,2, form a pixel plane for which mapping from pixel coordinate to world coordinate exists, possibly defined with a linear mapping (Affine matrix [1]), or maybe GCPs [2], or RPCs[3], or reduced resolution location arrays.

The only way to handle this data in a generic case (anything but axis aligned imagery) is to construct two 2d coordinate arrays, this adds 16 bytes per pixel, which is a huge overhead for what is typically 2-byte data to begin with.

Instead, if I could tell xarray that:

  1. Dimensions 1,2 are “linked”
  2. Given index into linked dimensions Tuple[int,...] one can compute coordinates Tuple[float, ...] by using provided Callable.

A lot of data providers ensure that mapping from pixel plane to world plane can be done independently for X,Y coordinates, and pixels are square, it’s easier to reason about data that way. But that doesn’t work for data that is distributed without post-processing, data that just returns raw image as seen by the sensor with a bunch of pixel locations with known spatial coordinated (GCPs).

But even for data processing tasks, picking pixel plane that is not aligned with the world plane can be helpful, see [4] for an example where rendering into a rotated plane significantly reduces memory requirements for the result. Too bad these are too hard to work with using current assumptions within the ecosystem of tools.

References:

[1] GeoBox Model — odc-geo 0.4.3 documentation
[2] Raster Data Model — GDAL documentation
[3] RFC 22: RPC Georeferencing — GDAL documentation
[4] Generating Rotated Images to Save Space · opendatacube/odc-stac Wiki · GitHub

I am suggesting we make a rather deep change to Xarray in which we are no longer storing coords explicitly as an array of floating point values. Basically keeping the native GeoBox representation attached to the dataset as a sort of virtual coordinate, rather than requiring explicit coercion of the transform to a materialized array.

It would be great if it were easy to provide to xarray with “compute mapping from pixel index to physical coords of the center on the fly”, for geo-spatial it would need to handle 2d->2d mapping of those.

Instead, if I could tell xarray that: 1. Dimensions 1,2 are “linked” 2. Given index into linked dimensions Tuple[int,...] one can compute coordinates Tuple[float, ...] by using provided Callable.

I haven’t read this thread in detail yet, but this is something that would (kind of) work today without requiring a deep change to Xarray.

This notebook provides an example that might be relevant. It implements an Xarray 2D WCSIndex that wraps an astropy.wcs.WCS object (therefore keeping track of world coordinate parameters), which is attached to lazy (virtual) coordinates that can be generated on the fly. WCSIndex also supports selecting data using world coordinate labels and using the astropy WCS object such that the result has consistent world and pixed coordinate values. There is a related discussion here.

I can imagine a similar index wrapping a GeoBox.

Xarray still has some limitations and doesn’t work properly with lazy indexed coordinates, but hopefully this will be addressed soon (https://github.com/pydata/xarray/pull/8124).

EDIT: once I’m starting working on Xarray indexes again (shortly) I’m happy to provide further guidance on how to implement such index.

3 Likes

the fundamental issue of figuring out pixel correspondence between two different rasters with very similar but not quite exactly the same coordinate system will remain though.

Xarray indexes may also implement custom logic for aligning / re-indexing different datasets, which makes this is possible too I think.

I double triple second the case for analytic/lazy coordinates in xarray, whether astro WCS-style or geo - these are essentially the same for affine transforms, I would hope we can support these and others (medical??).

It’s been a long time coming! All the code to align, slice, reproject and otherwise manipulate such coordinates already exists in various places.

(in response to @kirill.kzb)

The only way to handle this data in a generic case (anything but axis aligned imagery)

That’s interesting. I think xarray should not try to “handle” the generic case but simply read what is there, if it’s a geotransform read those six numbers (or represent as four explicitly in the simplest case), ready for materializing regular 1D coord arrays, if it’s gcps/rpcs/geolocation arrays read those too, but as-is. They are for the warper api (or esmf or sim) to use - it seems like you’re suggesting xarray should build-in geospatial-resolving workflows, but I would say that is definitely out of scope. With the geolocation numbers or arrays just represent them and hand them along when the time comes.

Realizing a potential gotcha in what I’m say: (yes, the geotransform case is always automatically handled now, with labelled coord arrays, but that’s the same divide in GDAL itself between Translate and Warp, it’s always geoferenced/axis-aligned in the first instance, but a dataset must stream through the warper for all other geolocation methods - Translate will simply write the array as is un-georeferenced and copy the geolocation arrays to the target (somewhat format-dependent), and the warper will always resolve to an axis-aligned target - in longlat by default - or to a target specified by the user, from extent/dim/resolution/crs - with missing elements from that spec inferred by the suggested-warp heuristics).

As part of this I think xarray should really have a look at some unnecessary extra layers between it and the GDAL library (not a downstream package) and see what consolidation is desirable.
(That’s part of what I’ll be exploring as input to this effort).

1 Like

Could you expand on this? Be more specific? What layers are you referring to?

The stack as I see it is xarray → rioxarray → rasterio → gdal.

Yes, are you suggesting xarray have its own GDAL api as such? Given it is a much broader superset?

In terms of consolidating the package landscape, Ryan said something similar here:

https://discourse.pangeo.io/t/comparing-odc-stac-load-and-stackstac-for-raster-composite-workflow/

and, I expect many opendatacube folks would describe the stack like

xarray → odc → rasterio → gdal

As far as I understand it, rioxarray didn’t exist when odc was being born, and after initial confusion (as an R user in this space) I realized why they both existed.

The other main thing is that rasterio is not GDAL, and GDAL ships with entirely sufficient bindings for Python of its own in osgeo.gdal/ogr/osr (and others). Similar problematic overlap exists in R, and R does not yet have bindings to the library that are development-ready (each significant project has built its own bindings as needed, and the community is essentially forked at least in two and arguably more camps - though there is a hopeful contender for real API access via the {gdalraster} package). In my Python journey I haven’t found use for rasterio, I get what I need from the osgeo bundle, this is after a journey in R from high level “GIS like” packages down to the actual core library, and that has significant benefits in efficiency and control rather than going through a higher level layer.

To talk about a couple of specifics of this, in terms of generating coordinates from a regular grid specification, rasterio is overkill, and GDAL is overkill, it’s very simple arithmetic to generate these and there’s a whole family of grid-logic tools that I would say belong in this space as well. Affine would be a sensible place to start, and even this has overlap with GeoBox so maybe that’s something to consider as well: GitHub - rasterio/affine: Affine transformation matrices

Also there is the multi-dim model in GDAL which would be a really excellent project for xarray to leverage, afaik rasterio doesn’t do anything in this space (and fair enough, its scope is wide enough). Multi-dimensional support · Issue #1759 · rasterio/rasterio · GitHub

(Also please be assured that I’m definitely not criticizing any packages or choices made here, I know that at every step difficult choices and hard work was done, I’m just riffing on the lovely vibe I see generally in Python to scan the landscape and consolidate where possible. I’m super impressed by so much in the Python and xarray world, and I think it’s very valuable to see what choices have been made in R and Python and other langs). .

Abstract tools for grid logic are really powerful and need to be championed way more IMO (I’m not enough across the space in Python yet, maybe it exists). Cell indexes, treating pixel position abstractly by index or row/column and helper functions for that are simple, functions-of dimension,extent (or dimension,transform for more general cases. I have my own versions in R and it was very valuable to separate that logic out of the geospatial package in R so it’s not bound to a format or data at all). In my dreams … I would separate out this code in GDAL itself, and have it as a nice standalone library that GDAL perhaps included.

1 Like

absolutely, I would wish that these api bindings were also shared across languages, note that SWIG is used in GDAL itself to bind for Python, C#, Perl, and others. Arrow has clearly shown the value of consolidation across languages. I have no idea technically if R could have its own osgeo.gdal but that would be my first choice there. ({gdalraster} is looking pretty good, I otherwise can’t do my own work without it or without my own crafted bindings, the high-level packages just preclude some of the best features in the library).