Example which highlights the limitations of NetCDF-style coordinates for large geospatial rasters

This is now being done in Add raster index by benbovy · Pull Request #846 · corteva/rioxarray · GitHub with a custom RasterIndex. Feedback welcome!

(I’m not exactly sure where this should be implemented ultimately as most of the logic could also be useful in other libraries such as odc-geo).

3 Likes

It’s been a while, but I’ve spent time exploring the issues and now have a clear opinion on the best way to move forward.
In the GDAL data model, the GeoTransform defines an affine transformation—typically representing regular grids with optional rotation—that provides implicit coordinates for raster pixels.

GDAL can converts coordinates defined as an affine transformations into explicit coordinate arrays (NetCDF-style). However the transformation can lead to inaccuracy issues due to the finite precision of floating-point types. Moreover, it appears sometimes optimal to encode coordinates as an affine transformation (aka implicit coordinates), allowing libraries to compute indices using this function rather than iterating over a large array of values.

CF (and thus NetCDF) supports several grid types, always encoded as an array of the coordinates values (explicit coordinates):

  • Regular (rectilinear) grid: coordinates have constant spacing along each axis.
  • Rectilinear grid: axes are straight and aligned with coordinate axes but may have non-uniform spacing.
  • Curvilinear grid: coordinates vary along both axes and form curved, structured grids defined by 2D coordinate arrays.
  • Unstructured grid (more advanced case): uses 1D or 2D coordinate arrays with an explicit connectivity (see mesh topology) to describe irregular spatial relationships.

“Affine" transformations (including rotation, skew, etc.) are not explicitly part of the CF model; coordinate systems can imply affine transformations, but CF does not support general affine transforms in the grid definition. Note that CF conventions enable the implicit coordinates encoding through the Subsampling, but only for regular grids.

While it’s clear that the GeoZarr data model should include affine transformation encoding, there are multiple ways of encoding these implicit coordinates (derived from affine transformation) in Zarr:

  • GeoTransform : affine transform (six coefficients) from GDAL data model
  • ModelTiepointTag & ModelPixelScaleTag : defines pixel size and origin from GeoTiff
  • ModelTransformationTag: defines a full 4×4 transformation matrix from GeoTiff
  • pandas.RangeIndex : start, step, count from xarray/pandas
  • Coordinate Subsamplng from CF (with possible extension)
  • Using geographic coordinates and resolution from OGC API Coverage (OGC API - Coverages - Part 1: Core)
  • GridFunctions from OGC GML

While each approach warrants deeper investigation, the pragmatic and efficient convention used by GDAL when converting GeoTIFF to NetCDF or Zarr is preferred: the CRS is translated into a standard CF grid_mapping variable, and the GeoTransform is included as an attribute of that variable.

To support multidimensional data (3D+), the CF conventions should be extended to allow a zero-dimensional coordinate variable with a grid_mapping and standard_name, (and probably an coordinate type attribute) enabling libraries such as GDAL or xarray to compute coordinates implicitly from the affine transformation, avoiding the need for explicit coordinate variable. Note that the dimension shapes are already provided in the variable model.

Example below:

dimensions:
  x = 1000 ;
  y = 1000 ;

variables:
  float temperature(y, x) ;
    temperature:units = "K" ;
    temperature:standard_name = "sea_surface_temperature" ;
    temperature:grid_mapping = "crs" ;

  float x(x) ;
    x:standard_name = "projection_x_coordinate" ;
    x:units = "m" ;
    x:grid_mapping = "crs" ;
    x:coordinate_type = "GeoTransform" ;

  float y(y) ;
    y:standard_name = "projection_y_coordinate" ;
    y:units = "m" ;
    y:grid_mapping = "crs" ;
    y:coordinate_type = "GeoTransform" ;

  char crs ;
    crs:grid_mapping_name = "transverse_mercator" ;
    crs:GeoTransform = "500000 30 0 0 0 -30" ; // origin_x, pixel_width, rotation_x, origin_y, rotation_y, pixel_height
   crs_wkt : PROJCS["WGS 84 / UTM zone 30N",GEOGCS["WGS 84",DATUM["World Geodetic System 1984",SPHEROID["WGS 84",6378137,298.257223563]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-3],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]

global attributes:
  title = "SeaSat Sea Surface Temperature" ;
  Conventions = "CF-1.8" ;

Edit: I removed the CRS expanded attributes. The expanded form allows to describe some projection types (sinusoidal, rotated_latitude_longitude, …) with only a few attributes. But to be discuss to make mandatory the GeoTransform (or equivalent) attribute and WKT in the GeoZarr spec of such implicit coordinates.

Looking forward your feedback.

1 Like

Let me just mention the thread on GeoZarr: Understanding concerns with CF encoding of CRS · Issue #20 · zarr-developers/geozarr-spec · GitHub

1 Like

Thanks for a detailed proposal. This looks all fine to me, for the exception of the location of GeoTransaform attribute: I’m not sure it belongs with the grid_mapping coordinate. If one were to store multiple variables with different resolutions (e.g. 15m, 30m, 60m bands of Landsat), one would have to define multiple versions of crs, each with the same grid_mapping_name and crs_wkt, but with different GeoTransform attributes. My, admittedly limited, understanding of grid_mapping in CF convention is that it describes SRS only, and not relation between pixel and world coordinates. I assume there is data out there that references the same grid_mapping coordinate/variable for rasters with varying resolutions.

I would put GeoTransfrom attribute on the x coordinate, and possibly repeat it on the y coordinate.

  float x(x) ;
    x:standard_name = "projection_x_coordinate" ;
    x:units = "m" ;
    x:grid_mapping = "crs" ;
    x:GeoTransform = "500000 30 0 0 0 -30" ; // GDAL GeoTransform

  float y(y) ;
    y:standard_name = "projection_y_coordinate" ;
    y:units = "m" ;
    y:grid_mapping = "crs" ;
    y:GeoTransform = "500000 30 0 0 0 -30" ; // GDAL GeoTransform

  char crs ;
    crs:grid_mapping_name = "transverse_mercator" ;
    crs:crs_wkt = 'PROJCS["WGS 84 / UTM zone 30N",GEOGCS["WGS 84",DATUM["World Geodetic System 1984",SPHEROID["WGS 84",6378137,298.257223563]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-3],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]'

To me this makes more sense, as we are replacing fully rendered coordinate x(x) and y(y) with “lazy” versions of the same, so storing needed attributes on the x,y coords makes most sense to me.

for the exception of the location of GeoTransaform attribute: I’m not sure it belongs with the grid_mapping coordinate.

@kirill.kzb FYI in the geozarr thread mentioned by @Christophe_Noel there are proposals of encoding the affine transform in its own variable (from Understanding concerns with CF encoding of CRS · Issue #20 · zarr-developers/geozarr-spec · GitHub on). Later on there has been some new discussion on how to reuse CF’s tie point coordinates: CF Coordinate Subsampling (section 8.3) and affine transformations · cf-convention · Discussion #411 · GitHub.

as we are replacing fully rendered coordinate x(x) and y(y) with “lazy” versions of the same

This is exactly what we are currently doing in GitHub - dcherian/rasterix: raster tools for xarray via a custom Xarray RasterIndex! Any chance of coordinating the work on this?

4 Likes

Alright check this out. A group of us (@benbovy, @scottyhq,@maxrjones, and I) have put together a new RasterIndex that allows indexing using the affine transform. Let’s try it.

Here is ryan’s code from earlier (collapsed). It creates big[latitude: 248400, longitude: 172800] and small[latitude: 3600, longitude: 3600].

import warnings

import xarray as xr
import zarr
from odc.geo.geobox import GeoBox, GeoboxTiles
from odc.geo.xr import xr_zeros

warnings.filterwarnings("ignore")

dx = 1 / 3600  # 30m resolution
epsg = 4326
crs = f"epsg:{epsg}"
big_bounds = (-82, -56, -34, 13)  # South America
big_box = GeoBox.from_bbox(big_bounds, crs=crs, resolution=dx)
chunk_shape = 3600, 3600
big_ds = xr_zeros(big_box, chunks=chunk_shape).rename("big")


chunk_shape = 3600, 3600
tiles = GeoboxTiles(big_box, chunk_shape)

tile_idx = (0, 1)
tile_box = tiles[tile_idx]
tile_ds = xr_zeros(tile_box, chunks=-1).rename("small")

Now with rasterix.RasterIndex.

import rasterix

bigi = rasterix.assign_index(big_ds)
tilei = rasterix.assign_index(tile_ds)

This will auto-detect an affine transform and create lazy coordinate variables. Importantly, alignment is done using the transform.

>>> xr.align(bigi, tilei, join="right")
(<xarray.DataArray 'big' (latitude: 3600, longitude: 3600)> Size: 104MB
 dask.array<getitem, shape=(3600, 3600), dtype=float64, chunksize=(3600, 3600), chunktype=numpy.ndarray>
 Coordinates:
   * longitude    (longitude) float64 29kB -81.0 -81.0 -81.0 ... -80.0 -80.0
   * latitude     (latitude) float64 29kB 13.0 13.0 13.0 13.0 ... 12.0 12.0 12.0
     spatial_ref  int32 4B 4326
 Indexes:
   ┌ longitude  RasterIndex (crs=None)
   └ latitude,
 <xarray.DataArray 'small' (latitude: 3600, longitude: 3600)> Size: 104MB
 dask.array<zeros_like, shape=(3600, 3600), dtype=float64, chunksize=(3600, 3600), chunktype=numpy.ndarray>
 Coordinates:
   * longitude    (longitude) float64 29kB -81.0 -81.0 -81.0 ... -80.0 -80.0
   * latitude     (latitude) float64 29kB 13.0 13.0 13.0 13.0 ... 12.0 12.0 12.0
     spatial_ref  int32 4B 4326
 Indexes:
   ┌ longitude  RasterIndex (crs=None)
   └ latitude)

And reindex:

bigi.reindex_like(tilei)

EDIT: I’ll add the disclaimer that this stuff is lightly tested and alpha-quality, but it’s looking quite good so far!

EDIT 2: A very cool aspect is that this Index is built on a CoordinateTransformIndex class provided by Xarray, with the coordinate transform described by a CoordinateTransform class. As you can see, Xarray’s goal here is to make it easy to build such indexes :).

9 Likes

That’s so freaking cool! Xarray flexible indexes FTW!