Resampling Zarr-backed Xarray wth VRTs and GDAL support for Zarr

Howdy howdy :cowboy_hat_face:

We lean heavily on the vrt approach when building large-ish Xarray datasets. We create one vrt per date/band from a list of tiffs/cogs.

We often have the need to reproject multiple Zarr-backed Xarray datasets to a common grid so they can be merged into one dataset. I have two approaches in mind with the constraint of using GDAL’s resampling mechanisms:

  1. We write tiffs/cogs for each chunk/band/date, build vrts with gdalbuildvrt from list of tiffs (with our new bounds and resolution) for each band/date and use rioxarray.open_dataset(path_to_vrt, chunks=(...)). We do this in an adhoc fashion today and it works, but we have to dump tiffs, which feels not-so-ideal.

  2. We use gdalbuildvrt and pass a Zarr object using GDAL’s Zarr reference scheme e.g. 'ZARR:"/vsigs/path_to_zarr/on/gcs":/variable_1:1' for variable_1 and the first date. Unfortunately, GDAL complains about mixed types (x,y in float64; variable in float32) or that NS is in the wrong order (dset.rio.resolution must both be negative to get things to work). One work around is something like gdalwarp -of VRT 'ZARR:"/vsigs/path_to_zarr/on/gcs":/variable_1:1' warped.vrt, but the transform does not seem to be calculated correctly and gdalwarp does not seem to take extent/resolution inputs to specify the transform.

Anyways, I am not looking for GDAL support here, but rather aim to see if anyone else has taken a similar approach and to put it out there for others with a similar curiosity. Any thoughts :pray: ?

1 Like

what’s doing the common gridding? I would hit the zarr band with GDAL 2D and push it through the warper, it can handle multiple disparate inputs and all the memory and write to any grid spec (extent, dimension, crs)

both the warper and zarr have had very recent fixes, some parts that just haven’t been exercised much I think, I’d be keen to look at the gdalwarp and Zarr issue you mention - you can use VRT to fix the input geotransform, and in recent versions do that with vrt://{src}?a_ullr without generating the VRT at all (gdalwarp especially doesn’t have those a_ssign args because that’s a translate job)

1 Like

The common gridding comes from gdalbuildvrt with its extent and resolution arguments, but I am now realizing gdalwarp does not have those because as you say it’s a translate job. In my case, the crs is consistent between input and outputs and by common gridding I am only referring to using a resampling op only.

I guess my question simplifies to given a set Zarr files in the same crs, how can I resample to a common grid creating a vrt or something I can use with rioxarray.open_rasterio to use chunks and therefore Dask.

I think you are saying that I can then push those vrts through gdalwarp in the event the crs is different, yeah?

I’ll look into specifying vrt without building it!

I still think there is a weird issue with Zarr and gdalbuildvrt. I’ll try to build up an example.

It looks like something like:

rioxarray.open_rasterio(
    "vrt://ZARR:/vsigs/path/to/store/group:/variable_1:1?a_ullr=-180,90,180,-90", 
    chunks="auto"
)

might actually work and avoid the need to even bypass making the vrt. Last thing to check is setting the target resolution, which would look something like:

rioxarray.open_rasterio(
    "vrt://ZARR:/vsigs/path/to/store/group:/variable_1:1?a_ullr=-180,90,180,-90&tr=.1,.1", 
    chunks="auto"
)

but option tr is not recognized. I am going to try to update rasterio + gdal since I am on 3.7.0 and some of these options seem to come in 3.8

GDAL 3.8 is not available via conda from what I can tell nor do I see a 3.8 docker image to test easily.

After having dug a bit further, I see you @Michael_Sumner VRT 'tr', 'r', 'srcwin', 'a_gt' for vrt:// connection by mdsumner · Pull Request #8330 · OSGeo/gdal · GitHub. :pray:

Any ideas on easy ways for me to test this without going down the make-from-source path. Or any ideas on time until 3.8 shows up in docker or conda? I see open issue in feedstock so probably not anytime soon.

It actually looks like this works.

gdalbuildvrt -tr .1 .1 test.vrt "vrt://ZARR:/vsigs/store/group:/variable_1:1?a_ullr=-180,90,180,-90"

where I noticed dropping a_ullr results in a Warning 1: gdalbuildvrt does not support positive NS resolution. Skipping.

Then, I can open with

rioxarray.open_rasterio("test.vrt", chunks="auto")

This might do the trick! I’ll try on a large-ish dataset and check the data and report back.

It would be sick if the same thing worked for multidimensional vrt and we only need to make one “vrt” for the entire xarray dataset, but I dead end with rioxarray saying the file format is not supported.

I don’t understand why not use xarray to change the resolution?

And, why doesn’t the Zarr have correct georeferencing? I hope the producers can fix it, and avoid this downstream churn.

It seems like GDAL’s involvement here is kind of unnecessary and I wonder if you should just avoid it - xarray surely doesn’t need GDAL for Zarr subsampling?

I would like to hear feedback about what one is supposed to do when coord arrays have the wrong values for a regular grid (it’s extremely common sadly) and I think the best strategy is to figure out how to fix that in a more Zarr/xarry hative way (exploding Zarr into band slices is a way to leverage the 2D GDAL engine but is a pain).

You are right about needing 3.8.0 for -tr in the vrt:// protocol and it is not yet released.

Leonard, only thing I have ever done in this space really is:

dst_crs = CRS.from_epsg(4326)

xres = (right - left) / dst_width
yres = (top - bottom) / dst_height
dst_transform = affine.Affine(xres, 0.0, left,
                              0.0, -yres, top)

vrt_options = {
    'resampling': Resampling.nearest,
    'crs': dst_crs,
    'transform': dst_transform,
    'height': dst_height,
    'width': dst_width,
}

f = r'thermal_endmembers.vrt'
epsg_to = 4326

with rasterio.open(f) as src:
    print('Source CRS:' +str(src.crs))
    with WarpedVRT(src, **vrt_options) as vrt:
        print('Destination CRS:' +str(vrt.crs))
        with ProgressBar():
            thermal_match = rioxarray.open_rasterio(vrt, chunks=(1, 8000,8000))
            thermal_match.rio.to_raster(r'J:\test.tif')
2 Likes

notebook version for reference

1 Like