I think it is also worth mentioning GitHub - xarray-contrib/rasterix: raster tools for xarray for the case where the underlying data is raster data.
Additionally to the chunk=-1 and encoding trick proposed above, you can effectively eliminate the need to write x and y coords altogether and compute on the fly which reduces bytes read/written to cloud storage.
import dask.array as da
import numpy as np
import rasterix
import xarray as xr
w = 2**20
h = 2**20
example = xr.Dataset(
data_vars={
"variables": (
("band", "x", "y"),
da.random.random((128, w, h), chunks=(-1)),
)
},
coords={"x": np.linspace(0, 5, w), "y": np.linspace(0, 5, h)},
)
# rasterix can read the GeoTransform from odc or rioxarray's spatial_ref convention
# so we make sure to write the transform
example = example.rio.write_transform(transform=example.rio.transform(recalc=True))
store1 = "/tmp/test1.zarr"
example.to_zarr(
store1,
compute=False,
mode="w",
zarr_format=3,
consolidated=True,
)
store2 = "/tmp/test2.zarr"
example.drop(("x", "y")).to_zarr(
store2,
compute=False,
mode="w",
zarr_format=3,
consolidated=True,
)
rt = xr.open_zarr(store1)
rt_wo_xy = xr.open_zarr(store2)
if "spatial_ref" in rt_wo_xy.data_vars:
# sometimes this coord gets written as a variable
rt_wo_xy = rt_wo_xy.set_coords("spatial_ref")
rt_wo_xy = rt_wo_xy.pipe(rasterix.assign_index)
# note the exact equality here that we don't get with compression tricks
np.testing.assert_equal(rt_wo_xy["x"].values, rt["x"])
np.testing.assert_equal(rt_wo_xy["y"].values, rt["y"])
A motivating use cases is for the case where x and y begin to get large and we don’t want to use the compression trick in the code above (and assume its slight precision errors).
This runs quickly locally, but as soon as you have to write and read many Mbs of coordinate data to/from the cloud we start to see much slower times when storing x and y coords.
The caveat is that any consumer of this data must compute x/y coords from the GeoTransform and rasterix makes this very convenient (with other niceties!).