I have spent quite some time trying to address Ryan’s second call of action and would like to propose a solution.
What I have come to the conclusion is that we need to store the following implicit minimum metadata information which allows us to calculate the explicit coordinates:
- GeoTransform
- Number of lat/lon
- For chunks, chunk reference index
With the option to store additional information around how the coordinate expresses its place in a grid cell (i.e. center, or elsewhere, right now I see lots of hard-coded assumptions that assume center). In CF world there is the optional bounds parameter that would allow you also to derive this.
Below I show I can accurately derive the big dataset explicit coordinates using just the GeoTransform and knowing the number of lats/lons.
from odc.geo.geobox import GeoBox, GeoboxTiles
from odc.geo.xr import xr_zeros
import xarray as xr
import numpy as np
import warnings
warnings.filterwarnings('ignore')
dx = 10 / 3600
epsg = 4326
crs = f"epsg:{epsg}"
big_bounds = (-82, -56, -34, 13) # South America
big_box = GeoBox.from_bbox(big_bounds, crs=crs, resolution=dx)
chunk_shape = 3600, 3600
big_ds = xr_zeros(big_box, chunks=chunk_shape).rename("big")
chunk_shape = 30, 30
tiles = GeoboxTiles(big_box, chunk_shape)
# Choosing a more middle tile
tile_idx = (2, 2)
tile_box = tiles[tile_idx]
tile_ds = xr_zeros(tile_box, chunks=-1).rename("small")
# Extract geotransform properties
# Critical metadata we need to save
big_geotransform = big_box.affine
small_geotransform = tile_box.affine
# Going to make the argument that only the big box affine needs to be saved
# along with number of lat/lon and tile index
# Extract the affine transformation parameters for the big box
# Critical metadata to have the geotransform saved
a, b, c, d, e, f = big_geotransform[0:6]
# Calculate the center of the first pixel
# Critical metadata we need predescribed, assuming it's the center
# In CF-conventions can derive this from lat_bounds and lon_bounds
pixel_center_x = c + a / 2 # big_ds_lon_start
pixel_center_y = f + e / 2 # big_ds_lat_start
# Generate latitude and longitude arrays representing the center of the pixels
# Critical metadata data needed to know the number of lat/lon
big_ds_derived_latitudes = np.array([pixel_center_y + e * i for i in range(len(big_ds.latitude))])
big_ds_derived_longitudes = np.array([pixel_center_x + a * i for i in range(len(big_ds.longitude))])
# Confirm these match
assert np.array_equal(big_ds_derived_latitudes, big_ds.latitude.values), "Arrays are not exactly equal"
assert np.array_equal(big_ds_derived_longitudes, big_ds.longitude.values), "Arrays are not exactly equal"
This should prove my above point. The issue indeed comes when I try to place a tile back into the original big_ds as Ryan tries.
# Attempting to align causes nans to be filled into the dataset
big_ds_aligned, _ = xr.align(big_ds, tile_ds, join="right")
big_ds_aligned.load().plot()
# Let's actually look at how much they are off...
# Calculating the offsets in terms of pixels for the tile extracted
# using the small geotransform, but we don't need to do this in this exact way
# if I have the tile index information
x_offset = int((small_geotransform.c - big_geotransform.c) / big_geotransform.a)
y_offset = int((big_geotransform.f - small_geotransform.f) / big_geotransform.e)
assert np.array_equal(big_ds.latitude.values[y_offset:y_offset+chunk_shape[0]], tile_ds.latitude.values), "Arrays are not exactly equal"
assert np.array_equal(big_ds.longitude.values[x_offset:x_offset+chunk_shape[1]], tile_ds.longitude.values), "Arrays are not exactly equal"
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[95], line 9
6 x_offset = int((small_geotransform.c - big_geotransform.c) / big_geotransform.a)
7 y_offset = int((big_geotransform.f - small_geotransform.f) / big_geotransform.e)
----> 9 assert np.array_equal(big_ds.latitude.values[y_offset:y_offset+chunk_shape[0]], tile_ds.latitude.values), "Arrays are not exactly equal"
10 assert np.array_equal(big_ds.longitude.values[x_offset:x_offset+chunk_shape[1]], tile_ds.longitude.values), "Arrays are not exactly equal"
AssertionError: Arrays are not exactly equal
They are not equal… but by how much
not_equal_indices = np.where(np.logical_not(np.equal(big_ds.longitude.values[x_offset:x_offset+chunk_shape[1]], tile_ds.longitude.values)))[0]
big_ds.longitude.values[x_offset:x_offset+chunk_shape[1]][not_equal_indices[0]] - tile_ds.longitude.values[not_equal_indices[0]]
-1.4210854715202004e-14
I think what we are skimming over is this part that DOES work, but has extremely small floating point inaccuracies:
# They appear to match
# This does work, but inexact
selection_of_big_ds = big_ds.sel(latitude=tile_ds.latitude, longitude=tile_ds.longitude, method="nearest", tolerance=1e-13)
To me, this is something that the geospatial world has often ignored by implementing the concept of method = “nearest”.
So my question - can we move forward with accepting the three implicit metadata fields I outline above as a start to find “the simple way that only involves storing a few numbers (rather than a big materialized array)”?
One last edit:
To resolve this specifically, you would always reference big_ds GeoTransform and the offsets which are implied with the tile indexes and chunk size.