Context
We have been working on the GeoZarr spec for quite some time. In the process, I have been confronted with the very different ways that the NetCDF world and the geospatial raster world (GeoTIFF) store the coordinates of data. For more background, see this issue
To summarize, here’s how it mostly works today:
- NetCDF-style data (including Zarr data written by Xarray) stores all coordinates explicitly, i.e.
longitude=[-180., -179.5, -179., ...]
- Most GeoTIFF data uses the AffineGeoTransform from the GDAL data model.
Although Xarray originally emerged for working with weather / climate data stored in NetCDF, it has since become the de-facto standard container for gridded geospatial data in python, including traditional geospatial rasters.
The bridge between the geospatial raster / GDAL world and Xarray is RioXarray (which wraps rasterio and interfaces it with the Xarray data model). Rasterio implements the actual logic that is needed to bridge affine-transform-based coordinates with explicit NetCDF-style coordinates, by simply converting the geomtric affine transformation data to explicit numpy arrays:
For the past year, I’ve been hearing folks like @Michael_Sumner and David Blodgett of USGS explain that this approach is insufficient, as the explicit representation suffers from fundamental inaccuracy issues related to the finite precision of floating-point data types/
Here I’ve developed an example to convince myself that this is indeed a serious problem that needs fixing deep down in the software stack.
The Example - Big Tiled Data Cube
A pretty common use case in our community is to take Landsat / Sentinel data and build some sort of harmonized spatio-temporal datacube on the planetary, continental, or country scale. Because this is generally too big to put in a single GeoTIFF, folks have two options:
- Break the cube up into many smaller GeoTIFFs
- Store the cube as one big Zarr Array with reasonable sized chunks
This example is relevant to option 2
I am going to use odg.geo to define both the big data cube and the smaller sub tiles.
from odc.geo.geobox import GeoBox, GeoboxTiles
from odc.geo.xr import xr_zeros
import zarr
import xarray as xr
import warnings
warnings.filterwarnings('ignore')
dx = 1 / 3600 # 30m resolution
epsg = 4326
crs = f"epsg:{epsg}"
big_bounds = (-82, -56, -34, 13) # South America
big_box = GeoBox.from_bbox(big_bounds, crs=crs, resolution=dx)
chunk_shape = 3600, 3600
big_ds = xr_zeros(big_box, chunks=chunk_shape).rename("big")
big_ds
Now let’s grab a sub tile
chunk_shape = 3600, 3600
tiles = GeoboxTiles(big_box, chunk_shape)
tile_idx = (0, 1)
tile_box = tiles[tile_idx]
tile_ds = xr_zeros(tile_box, chunks=-1).rename("small")
tile_ds
This dataset should slot right into the parent array, right?
Wrong!
The coordinates don’t align exactly
try:
big_ds.sel(latitude=tile_ds.latitude, longitude=tile_ds.longitude)
except KeyError:
# expected since values are not numerically identical
pass
# does work, but inexact
big_ds.sel(latitude=tile_ds.latitude, longitude=tile_ds.longitude, method="nearest", tolerance=1e-13)
# attempting to align causes nans to be filled into the dataset
big_ds_aligned, _ = xr.align(big_ds, tile_ds, join="right")
big_ds_aligned.load().plot()
Why does this matter? If I create a Zarr dataset from big_ds
and then attempt to write tile_ds
into via region="auto"
, it fails
memstore = zarr.MemoryStore()
big_ds.to_zarr(memstore, compute=False, consolidated=False)
tile_ds.to_zarr(memstore, region="auto")
# --> KeyError: "Not all values of coordinate 'longitude' in the new array were found in the original store. Writing to a zarr region slice requires that no dimensions or metadata are changed by the write."
Workarounds
There are many places in our stack where we could work around this problem (without solving it at its core). Two that come to mind are:
- We could use logical (rather than physical coordinates), e.g.
tile_ds.to_zarr(memstore, region={"latitude": slice(0,3600), "longitude": slice(3600, 7200)})
or using the low-level Zarr API to write the data. This works today, but it requires more context. The information about these logical coordinates is not part of the Xarray dataset itself. odc.geo.geobox.GeoboxTiles
could try to generate the coordinates more carefully, such that they exactly line up with the parent dataset.
These both ignore the fundamental problem: explicitly materializing these coordinates as floating point data is both inefficient and inaccurate.
A Possible Solution
A true solution would involve the following elements:
- Xarray implements a
RangeIndex
which is generated by essentially three parameters: start, stop, step. This index would support alignment, subsetting, etc. in a consistent way. - We define a way to serialize these indexes in a simple way that only involves storing a few numbers (rather than a big materialized array) and implement the necessary encoding / decoding pathways in Xarray.
- Eventually this could be proposed as CF convention for implicit coordinates. However, a working prototype should come first.
For an MVP, we could limit this to rectilinear affine transforms, which is simpler and separable dimension by dimension.
I’d be very eager to get feedback from @Alex_Leith and @kirill.kzb on this proposal. It seems like solving this problem would resolve a lot of the limitations in using Xarray (and by extension, Zarr) for large geospatial rasters.