Example which highlights the limitations of NetCDF-style coordinates for large geospatial rasters

Think I saw something about removing Perl bindings?

In terms of users of ODC - that would be true for people at this end of things, but the general users might see it as ‘xarray is this thing I have to play around with when doing work in the ODC tool people have set up for me’?

And for more general users interfacing - osgeo [ugly] - rasterio [bit less ugly] - rioxarray [etc]

Uses for rasterio are generally - environment wrapping, minimal installs - e.g. micromamba-ing some cloud deployment into exists or it already exists on something not changeable and some prior vrt shenanigans.

1 Like

@Michael_Sumner xarray library is more like numpy, but at higher level of abstraction. It’s way more generic than geospatial application you are thinking about. And it’s a good thing.

All these other geospatial libraries (datacube including) leverage xarray machinery to provide a convenient interface to raster data that DOES have geo-referencing information present. Just like xarray itself leverages machinery provided by numpy and dask that sit at a lower level of abstraction.

Problem is that there is no fully established convention across those libraries as to how represent geospatial metadata for an xarray.DataArray object in Python itself, nothing to do with on-disk format. We don’t even have an agreed upon goto module for representing CRS. Under the hood it all goes to PROJ C code, but some get there via rasterio.crs, some go via pyproj, some via gdal bindings, some via custom Cython code.

GDAL is great and essential piece of software that IS used by a great deal of geospatial python modules, but usually NOT via GDAL python bindings. It’s not because GDAL bindings are not Pythonic enough, it’s because installation is such a pain. GDAL python module does not ship any binary wheels, so you need to compile it, AND you need to compile it against exactly that version of GDAL libraries. Sure, it’s easy enough with conda, but you can’t only support conda based workflows. I get why GDAL only offers source distribution, nothing wrong with that. But this is why most choose to depend on rasterio when they need to use a tiny part of what GDAL has on offer. Because Sean of rasterio did the hard work of setting up binary distribution of GDAL C++ library and also provided a convenient Python bindings to a subset of GDAL functionality that deals with rasters.

Compare rasterio vs GDAL on pypi.

1 Like

:point_up: this strikes me as a very important and very solvable problem. Several years ago, there was some hope that geoxarray could become this. Recently geoxarray development has picked back up again. @djhoese - do you have any thoughts on how geoxarray might fit in here?

In the meantime, I think odc.geo and its GeoBox has emerged as a great solution. I have been using it more recently, and I’m really impressed. Excellent API and user experience.

1 Like

Thanks for bringing me in Ryan. I’ve been struggling with these same issues. I come from a place of using pyresample for many years for my geo-related use cases. In pyresample we have a single object representing all geo “stuff” and it generally comes down to two types: an area and a swath. An area can be thought of affine + CRS information. A swath handles our cases for atmospheric polar-orbiting satellites where pixels are not uniform but we refer to them by their center points in lon and lat 2D arrays.

Geoxarray was created to kind of take my pyresample use cases and make them (more) accessible from an xarray interface (a .geo accessor). Development has been slow not only because pyresample’s existing objects are good enough for me, but because:

  1. I need to support pure numpy arrays in my pyresample code too so it seems foolish to have two code paths (one for xarray and one for numpy arrays) that have the same end result.
  2. I struggle with putting the different parts of my geolocation understanding into an xarray object. I’ve always thought of it as these separate “area”-like objects so splitting it and hiding it into coordinates (spatial_ref/CRS + x + y spatial coordinates) inside another object (DataArray/Dataset) seems unhelpful.
  3. The biggest benefit for something like geoxarray is likely to incorporate indexing with Index objects, but I haven’t done that because I’m not up to date on the upstream xarray differences and other projects have done so much more work on finalizing this functionality.
  4. There is no consistent on-disk format for all our use cases that also maps to the Xarray model or isn’t needlessly over-complicated for my use cases.

I do think geoxarray is still in a good place to be a dumping ground for all dimension and coordinate stuff when it comes to geolocation, even if it isn’t be used that way yet.

As far as I know pyproj and rasterio CRS objects are pretty interoperable at this point. That is, their __init__ or .from_user_input methods can convert CRS objects from the other library. If you depend on rasterio/GDAL then it makes sense to me to use rasterio’s CRS object, but otherwise I think it makes more sense to depend on pyproj given there is no GDAL dependency.

2 Likes

It’s good that they are construction time interoperable, but that’s only good enough when you treat CRS as opaque property to carry through processing pipeline: load | transform | save. For that purpose simple WKT string is good enough also. What if one wanted more, something common, like “are these two CRS equivalent”, which is surprisingly non-trivial.

Wish we just settled on pyproj.CRS as the primary class for exposing CRS info in a richer format than just text data.

It’s also worth mentioning that with pypi based installs you are likely to end-up with multiple versions of PROJ library shipped by multiple modules. So construction time compatible property might not hold for some of the less used projections.

1 Like

I have proposed a very concrete task that could be used to advance this issue in this discussion:

Try to modify RioXarray’s generate_spatial_coords to create a custom Range index from the affine parameters instead of a explicit x, y coordinate variables.

Is anyone here susceptible to nerd swiping? :nerd_face:

2 Likes

Thanks Ryan, I’m going to try to follow this. I have been trying to summarize the space between “geospatial” and “xarray” thinking for a while, and I realized a way to couch the issues - my intention is to illustrate each of these scenarios from real-world data sets, but it’s going to take a while to do that properly. I want to put this framing in front of the community because it’s not as simple as lazy-fying the affine transform to xarray coords (because, sometimes the data is stored in a funky way and we want the user to be able to fix that in a virtual way).

A framing, illustrations to come, or on request. :pray:

Some grids are not actually rectilinear or curvilinear.

  • a grid can present as rectilinear because of numeric precision
  • a grid can present as curvilinear because of explicit storage of longlat arrays, although the data is a regular projected crs grid
  • other situations are possible, and the inexactness or inability to determine that coordinates are regular (or intended to be) causes confusion and technical problems

NetCDF and xarray atm will by requirement present as rectilinear by expanding a regular to explicit centre-point coord arrays, I call this degenerate rectilinear and I think it’s an accurate description, and apt. (But if there’s a better terming I’m glad to hear it. Degenerate in terms of redundancy, I see it as an entropy-increasing problem).

I think that we can identify a family of cases, valid and desirable, but a bit tough to separate access intentionally atm.

  • (A) a non-georeferenced array (this rightly sits in [x: 0,ncol, y: 0,nrow] or [x: 0,1, y:0,1] for the relevant axes)
  • (B) a georeferenced array in standard geospatial affine (either corner-aligned bbox, or actual offset/scale/shear)
  • (C) a degenerate rectilinear form of (B), we store xs and ys explicitly as centre-coords (but they are a monotonic regular-increasing sequence (or decreasing, still equivalent))
  • (D) a degenerate rectilinear form of (B), but untransformed from native projected to longitude and latitude (i.e. old Aviso Mercator stored in longlat rectilinear netcdf)
  • (E) a degenerate curvilinear form of (B), we store every longitude and latitude untransformed from the projected transform
  • (F) a degenerate curvilinear form of (C), we store every native X and Y from the projected transform (this never happens, though native longlat stored this way is not out of the question, equivalent but less compact than (C))
  • (G) truly actual rectilinear (but note cases of (D))
  • (H) truly actual curvilinear (but note cases of (E)), and all the complex variants above (Arakawa schemes etc)

Note (D) can work because longlat and Mercator are (mostly) isomorphic, it’s not generally possible and sometimes you see this done but totally in error.

I think the user-space needs to be able to reach in and lightly change the intepretation when they know this is the case (there’s VRT, kerchunk, and other virtualizations maybe there’s a way to consolidate).

GDAL can resolve all of these cases (within limits of course, relevant to the data itself and what it represents) by driving the interpretation through the warper, the coordinate arrays present as “geolocation arrays” in metadata, or can be assigned as such.

I’m not saying this in criticism the rioxarray Range proposal, that sounds good to me and I’m keen to explore how it sits in this framing. I want to know if others think these cases are important or if I’ve missed anything, thanks!

I thought of another case, an affine transform with actual non zero shear terms, in xarray now this would be curvilinear - it’s an example of (E) but not by unprotecting to longlat, and it’s certainly reducible to generator params but it must be by a coupled x,y coord materializer.

@Michael_Sumner I think this framework is extremely useful. Your categorization reminds me of the VTK data model.


[via https://www.kitware.com/easy-data-conversion-to-vtk-with-python/]

Using your lexicon, I’d say what I am proposing with the rioxarray task is to go from the current situation – (C) - to a better place - (B). If the affine transform has non-zero shear terms, then it would be from (E) to (B).

Baby steps! If we can solve this, then it will light the way towards solving the more complex cases. A good general principle goal be to eliminate all of the degenerate representation and replace them with the more compact, exact ones.

2 Likes

Excellent, I’m really enjoying writing up examples in Python now.

Thinking of very simple cases, the ungeoreferenced matrix itself has a bit to be said about it here - and seeing those vector vtk illustrations reminds me of intermediate representations, which I realize now is lazy evaluation inherently.

The extreme case of converting pixels to polygons, there are multiple steps available before full realization - edge points instead of centres, and indexed edges (or quads or triangles) present varying compaction and application opportunity.

I.e only generate unique pixel corners, it’s simple templating and grid logic to index those by geometric primitives, and primitives are easier to bend across coordinate systems, can be materialized incrementally - a rich family of these could help bridge pixel grids to meshes generally and to global grids too.

I can’t stop thinking now about how this is about intermediate forms generally. (And how raster and vector are two ends of a continuum, not actually disparate types).

I’m writing in a few examples for a framing, and yeah baby steps but big dreams :smile:

Very true. I think I inherently bring this perspective coming from the world of climate modeling, where we think very explicitly about finite grid cell geometry. The word “mesh” is very apt here.

1 Like

I have spent quite some time trying to address Ryan’s second call of action and would like to propose a solution.

What I have come to the conclusion is that we need to store the following implicit minimum metadata information which allows us to calculate the explicit coordinates:

  • GeoTransform
  • Number of lat/lon
  • For chunks, chunk reference index

With the option to store additional information around how the coordinate expresses its place in a grid cell (i.e. center, or elsewhere, right now I see lots of hard-coded assumptions that assume center). In CF world there is the optional bounds parameter that would allow you also to derive this.

Below I show I can accurately derive the big dataset explicit coordinates using just the GeoTransform and knowing the number of lats/lons.

from odc.geo.geobox import GeoBox, GeoboxTiles
from odc.geo.xr import xr_zeros
import xarray as xr
import numpy as np
import warnings
warnings.filterwarnings('ignore')

dx = 10 / 3600  
epsg = 4326
crs = f"epsg:{epsg}"
big_bounds = (-82, -56, -34, 13)  # South America
big_box = GeoBox.from_bbox(big_bounds, crs=crs, resolution=dx)
chunk_shape = 3600, 3600
big_ds = xr_zeros(big_box, chunks=chunk_shape).rename("big")


chunk_shape = 30, 30
tiles = GeoboxTiles(big_box, chunk_shape)

# Choosing a more middle tile
tile_idx = (2, 2)
tile_box = tiles[tile_idx]
tile_ds  = xr_zeros(tile_box, chunks=-1).rename("small")

# Extract geotransform properties
# Critical metadata we need to save
big_geotransform = big_box.affine
small_geotransform = tile_box.affine

# Going to make the argument that only the big box affine needs to be saved
# along with number of lat/lon and tile index

# Extract the affine transformation parameters for the big box
# Critical metadata to have the geotransform saved
a, b, c, d, e, f = big_geotransform[0:6]

# Calculate the center of the first pixel
# Critical metadata we need predescribed, assuming it's the center
# In CF-conventions can derive this from lat_bounds and lon_bounds
pixel_center_x = c + a / 2 # big_ds_lon_start
pixel_center_y = f + e / 2 # big_ds_lat_start

# Generate latitude and longitude arrays representing the center of the pixels
# Critical metadata data needed to know the number of lat/lon
big_ds_derived_latitudes = np.array([pixel_center_y + e * i for i in range(len(big_ds.latitude))])
big_ds_derived_longitudes = np.array([pixel_center_x + a * i for i in range(len(big_ds.longitude))])

# Confirm these match
assert np.array_equal(big_ds_derived_latitudes, big_ds.latitude.values), "Arrays are not exactly equal"
assert np.array_equal(big_ds_derived_longitudes, big_ds.longitude.values), "Arrays are not exactly equal"

This should prove my above point. The issue indeed comes when I try to place a tile back into the original big_ds as Ryan tries.

# Attempting to align causes nans to be filled into the dataset

big_ds_aligned, _ = xr.align(big_ds, tile_ds, join="right")

big_ds_aligned.load().plot()

# Let's actually look at how much they are off... 

# Calculating the offsets in terms of pixels for the tile extracted
# using the small geotransform, but we don't need to do this in this exact way
# if I have the tile index information
x_offset = int((small_geotransform.c - big_geotransform.c) / big_geotransform.a)
y_offset = int((big_geotransform.f - small_geotransform.f) / big_geotransform.e)

assert np.array_equal(big_ds.latitude.values[y_offset:y_offset+chunk_shape[0]], tile_ds.latitude.values), "Arrays are not exactly equal"
assert np.array_equal(big_ds.longitude.values[x_offset:x_offset+chunk_shape[1]], tile_ds.longitude.values), "Arrays are not exactly equal"


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[95], line 9
      6 x_offset = int((small_geotransform.c - big_geotransform.c) / big_geotransform.a)
      7 y_offset = int((big_geotransform.f - small_geotransform.f) / big_geotransform.e)
----> 9 assert np.array_equal(big_ds.latitude.values[y_offset:y_offset+chunk_shape[0]], tile_ds.latitude.values), "Arrays are not exactly equal"
     10 assert np.array_equal(big_ds.longitude.values[x_offset:x_offset+chunk_shape[1]], tile_ds.longitude.values), "Arrays are not exactly equal"

AssertionError: Arrays are not exactly equal

They are not equal… but by how much

not_equal_indices = np.where(np.logical_not(np.equal(big_ds.longitude.values[x_offset:x_offset+chunk_shape[1]], tile_ds.longitude.values)))[0]

big_ds.longitude.values[x_offset:x_offset+chunk_shape[1]][not_equal_indices[0]] - tile_ds.longitude.values[not_equal_indices[0]]

-1.4210854715202004e-14

I think what we are skimming over is this part that DOES work, but has extremely small floating point inaccuracies:

# They appear to match 
# This does work, but inexact
selection_of_big_ds = big_ds.sel(latitude=tile_ds.latitude, longitude=tile_ds.longitude, method="nearest", tolerance=1e-13)

To me, this is something that the geospatial world has often ignored by implementing the concept of method = “nearest”.

So my question - can we move forward with accepting the three implicit metadata fields I outline above as a start to find “the simple way that only involves storing a few numbers (rather than a big materialized array)”?

One last edit:

To resolve this specifically, you would always reference big_ds GeoTransform and the offsets which are implied with the tile indexes and chunk size.

5 Likes

Sorry for taking so long to reply to this great post Brianna.

I think can definitely move forward with this. I have come to see that my example mixes up two different issues which are kind of separable:

  1. How to compactly store the “GeoBox” of a single geospatial raster image in Xarray. Here is seems quite obvious that we should just support storing the GeoTransform and building custom indexes / serializers for it.
  2. How to correctly compute the coordinates for “tiled” rasters given finite floating-point precision. Here it seems inevitable that reconstructing an individual tile’s exact position within a tile grid without any information about the broader grid embedded in the tile will always be problematic. This is true regardless of whether you are using explicit coordinates or geotransform-based coordinates.

So I think we should move forward on 1 without requiring that we also solve 2.

2 Likes