Example which highlights the limitations of NetCDF-style coordinates for large geospatial rasters

briannapagan · June 14, 2024, 5:29pm

I have spent quite some time trying to address Ryan’s second call of action and would like to propose a solution.

What I have come to the conclusion is that we need to store the following implicit minimum metadata information which allows us to calculate the explicit coordinates:

GeoTransform
Number of lat/lon
For chunks, chunk reference index

With the option to store additional information around how the coordinate expresses its place in a grid cell (i.e. center, or elsewhere, right now I see lots of hard-coded assumptions that assume center). In CF world there is the optional bounds parameter that would allow you also to derive this.

Below I show I can accurately derive the big dataset explicit coordinates using just the GeoTransform and knowing the number of lats/lons.

from odc.geo.geobox import GeoBox, GeoboxTiles
from odc.geo.xr import xr_zeros
import xarray as xr
import numpy as np
import warnings
warnings.filterwarnings('ignore')

dx = 10 / 3600  
epsg = 4326
crs = f"epsg:{epsg}"
big_bounds = (-82, -56, -34, 13)  # South America
big_box = GeoBox.from_bbox(big_bounds, crs=crs, resolution=dx)
chunk_shape = 3600, 3600
big_ds = xr_zeros(big_box, chunks=chunk_shape).rename("big")


chunk_shape = 30, 30
tiles = GeoboxTiles(big_box, chunk_shape)

# Choosing a more middle tile
tile_idx = (2, 2)
tile_box = tiles[tile_idx]
tile_ds  = xr_zeros(tile_box, chunks=-1).rename("small")

# Extract geotransform properties
# Critical metadata we need to save
big_geotransform = big_box.affine
small_geotransform = tile_box.affine

# Going to make the argument that only the big box affine needs to be saved
# along with number of lat/lon and tile index

# Extract the affine transformation parameters for the big box
# Critical metadata to have the geotransform saved
a, b, c, d, e, f = big_geotransform[0:6]

# Calculate the center of the first pixel
# Critical metadata we need predescribed, assuming it's the center
# In CF-conventions can derive this from lat_bounds and lon_bounds
pixel_center_x = c + a / 2 # big_ds_lon_start
pixel_center_y = f + e / 2 # big_ds_lat_start

# Generate latitude and longitude arrays representing the center of the pixels
# Critical metadata data needed to know the number of lat/lon
big_ds_derived_latitudes = np.array([pixel_center_y + e * i for i in range(len(big_ds.latitude))])
big_ds_derived_longitudes = np.array([pixel_center_x + a * i for i in range(len(big_ds.longitude))])

# Confirm these match
assert np.array_equal(big_ds_derived_latitudes, big_ds.latitude.values), "Arrays are not exactly equal"
assert np.array_equal(big_ds_derived_longitudes, big_ds.longitude.values), "Arrays are not exactly equal"

This should prove my above point. The issue indeed comes when I try to place a tile back into the original big_ds as Ryan tries.

# Attempting to align causes nans to be filled into the dataset

big_ds_aligned, _ = xr.align(big_ds, tile_ds, join="right")

big_ds_aligned.load().plot()

# Let's actually look at how much they are off... 

# Calculating the offsets in terms of pixels for the tile extracted
# using the small geotransform, but we don't need to do this in this exact way
# if I have the tile index information
x_offset = int((small_geotransform.c - big_geotransform.c) / big_geotransform.a)
y_offset = int((big_geotransform.f - small_geotransform.f) / big_geotransform.e)

assert np.array_equal(big_ds.latitude.values[y_offset:y_offset+chunk_shape[0]], tile_ds.latitude.values), "Arrays are not exactly equal"
assert np.array_equal(big_ds.longitude.values[x_offset:x_offset+chunk_shape[1]], tile_ds.longitude.values), "Arrays are not exactly equal"


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[95], line 9
      6 x_offset = int((small_geotransform.c - big_geotransform.c) / big_geotransform.a)
      7 y_offset = int((big_geotransform.f - small_geotransform.f) / big_geotransform.e)
----> 9 assert np.array_equal(big_ds.latitude.values[y_offset:y_offset+chunk_shape[0]], tile_ds.latitude.values), "Arrays are not exactly equal"
     10 assert np.array_equal(big_ds.longitude.values[x_offset:x_offset+chunk_shape[1]], tile_ds.longitude.values), "Arrays are not exactly equal"

AssertionError: Arrays are not exactly equal

They are not equal… but by how much

not_equal_indices = np.where(np.logical_not(np.equal(big_ds.longitude.values[x_offset:x_offset+chunk_shape[1]], tile_ds.longitude.values)))[0]

big_ds.longitude.values[x_offset:x_offset+chunk_shape[1]][not_equal_indices[0]] - tile_ds.longitude.values[not_equal_indices[0]]

-1.4210854715202004e-14

I think what we are skimming over is this part that DOES work, but has extremely small floating point inaccuracies:

# They appear to match 
# This does work, but inexact
selection_of_big_ds = big_ds.sel(latitude=tile_ds.latitude, longitude=tile_ds.longitude, method="nearest", tolerance=1e-13)

To me, this is something that the geospatial world has often ignored by implementing the concept of method = “nearest”.

So my question - can we move forward with accepting the three implicit metadata fields I outline above as a start to find “the simple way that only involves storing a few numbers (rather than a big materialized array)”?

One last edit:

rabernat:

memstore = zarr.MemoryStore()
big_ds.to_zarr(memstore, compute=False, consolidated=False)
tile_ds.to_zarr(memstore, region="auto")

# --> KeyError: "Not all values of coordinate 'longitude' in the new array were found in the original store. Writing to a zarr region slice requires that no dimensions or metadata are changed by the write."

To resolve this specifically, you would always reference big_ds GeoTransform and the offsets which are implied with the tile indexes and chunk size.

Topic		Replies	Views
Netcdf to Zarr best practices Data	13	10300	February 10, 2021
GOES-16 ABI-L1b-RadF in Zarr format?	0	462	April 20, 2022
Recommendation for hosting cloud-optimized data Data	15	2790	January 21, 2022
Geographic Index Data	15	2495	October 22, 2020
Formatting Radio Occultation Data for the Cloud! Data	1	479	May 11, 2021

Example which highlights the limitations of NetCDF-style coordinates for large geospatial rasters

Related topics