What's Next - Software - Regridding

keewis · December 9, 2023, 5:29pm

Since there appears to be a sufficient amount of people interested, I’m opening a separate topic on regridding (Ryan’s topic is focused on conservative regridding, which is important but certainly not the only regridding method).

From What's Next - Software - #4 by jbednar by @TomNicholas:

The regridding bullet point was an attempt to nerd-snipe someone(s) into fleshing out the prototype that Ryan showed is possible (here Conservative Region Aggregation with Xarray, Geopandas and Sparse).

@keewis also said

(I have something like this working in https://github.com/IAOCEA/xarray-healpy, where the name is not that descriptive… it really is just general regridding using a tree and numba to do bilinear interpolation)

IMO this would be a great project for an institution to take ownership of - it requires sustained effort to flesh out, but would certainly be extremely widely used. But anyone interested in that should probably go to the thread linked above.

I will note that while does work in a few very specific cases, there’s a lot to fix / work on, and I haven’t gotten around to actually verifying that the interpolation does what it should (besides closely looking at plots of the original and the result).

The idea is that most (if not all) interpolation methods can be split into several steps:

find the neighbors of a target cell in the source grid (this tends to be pretty expensive in the general case)
compute the weights and put them into sparse matrix form
apply the weights by performing a sparse matrix multiplication

Step 1 and 2 are usually combined, but I believe that exposing step 1 separately is useful beyond just interpolation / regridding.

Some additional features I’d like to have:

it should be possible to compute the weights even if the source grid, the target grid or both don’t fit into memory
it should support regridding using cell geometries, not just cell centers / cell bounds

In any case, I’d be happy to collaborate on figuring out how to get there (whether that be by extending a existing library or writing one from scratch), though I will say that I don’t have a lot of prior experience in this space.

Michael_Sumner · December 11, 2023, 9:01pm

the GDALWarp library function already handles a lot of this, I know it doesn’t suit the array-focused community so much but there’s a lot there and with some attention on the multidim model a lot could be done here. For my interest I want to fix some longitude wrapping issues with geolocation arrays used with the warper, it would fix a whole lot of workflow problems we have.

I’ll be trying to get across the way xarray and c sees this space and work examples

keewis · December 13, 2023, 6:11pm

(I know next to nothing about GDAL, so my views might be a misconception)

The reason why right now we can’t really rely on esmf / xesmf is that it is still tricky to install them, even from conda-forge (which made things better but not perfect), and it is still not possible to compute the weights in parallel on a dask cluster: we’d need to use MPI and a command line program for this, which means we’re back to the old paradigm of transformations between files – i.e. no streaming of the data. It might be possible to resolve both of these issues, but I don’t know how easy that would be (but see also the reasoning given in Conservative Region Aggregation with Xarray, Geopandas and Sparse on why a lighter-weight library might be preferable).

For similar reasons I’m somewhat sceptical about using GDAL for this (though again, I just might not know GDAL well enough): my conception is that GDAL had similar issues with installing in the past (not sure about now). More fundamentally, though, I was under the impression that GDAL warp worked mostly on images, is that right? This would mean rectangular pixels arranged in 2D arrays, which would exclude the DGGS that I’d very much like to support as well, and which can typically only be represented as a 1D array of faces.

To be fully explicit, I think the requirements we have for a general regridding packages are (and I might be missing some):

cross-platform
easy to install (conda-forge helps a lot here)
works with dask or other chunked arrays (both when computing weights as well as when applying them)
allows regridding between large, potentially larger than memory grids
close to arbitrary cell geometries (my personal motivation for this are DGGS)
(sufficiently) high performance

Michael_Sumner · December 13, 2023, 10:45pm

no it works with geolocation arrays, and for any numeric type , swaths or model output etc. They can be rectilinear or curvilinear (I expect offset grids aren’t entirely correctly dealt with here but it goes a long way and better than a lot of workflows I’ve seen). You set the output extent and dimension (or resolution) and crs, and set the resampling algorithm and configuration options and it does the rest. This generalizes across image subsampling and warping for regular or curvilinear sources, files, databases , urls in a way I don’t see in any other free software, and with virtualization gor various stages with VRT

installing complex libs is hard but support is good, I don’t have problems now on standard linux with Python and R if I avoid conda et al. Admittedly I haven’t got esmf going but I haven’t put effort in yet.

Michael_Sumner · December 13, 2023, 10:47pm

I see DGGS as sitting squarely in a rasterizing of polygons setting (with appropriate mesh efficiencies avoiding materializing as “dumb polygons”), but indeed I would want to see that integrated - but the pieces all exist in core libs

keewis · February 2, 2024, 7:37pm

Coming back to this, I am still somewhat skeptical about pulling in GDAL if we don’t use it otherwise, as it is a very heavy dependency. However, for environments where it is installed anyways it would definitely make sense to use it (and from what I can tell rioxarray and odc-geo interface with it through rasterio to do reprojection, and odc-geo even has a dask-aware version – I never tried either of these, though).

So for workflows that don’t already involve GDAL that still leaves us with the attempt to write a new library. I am also not sure if it is a good idea to start entirely from scratch, but I can’t really find any software that has the properties I’m looking for (I didn’t do an exhaustive search either, though).

Looking at the problem itself, I think the hardest part is the analysis of the source grid for the neighbor search (for distance-based interpolation algorithms like nearest-neighbor or IDW a KDTree with an appropriate metric might be sufficient already). Once the neighbor search is done the implementation of the algorithm to produce a sparse matrix of weights should be relatively straight-forward, and the application of weights works already very well.

Michael_Sumner · February 3, 2024, 7:17am

I agree with wll this and appreciate the feedback, there’s just a lot can already be done with GDAL and so maybe starting again isn’t entirely the best way. rasterio is not GDAL, and its widespread use and stand-in for the role plays as a bit of a mask. But, where and how I can or even should say stuff like this is not entirely clear. It’s the same in R, proxy packages set the scene but mask the foundation (sometimes for good reason, downstream flow can take time but sometimes it doesn’t get visibility when you’re well down the river).

keewis · September 4, 2024, 6:06pm

Here’s what I’ve found out since last post (I still don’t feel like an expert on this, though):

regridding algorithms work either by mapping source points to the target / destination grid (s2d in xesmf / esmf), or by mapping destination points to the source grid (d2s). Either way we need to get both grids into the same coordinate system.
For any interpolation algorithm we need to be able to (efficiently) find neighbouring / overlapping grid cells in a given grid
we can’t just use standard tree structures (kdtree, balltree with an appropriate metric): those only allow for searching neighbouring cell centers, but really we’re interested in cell / pixel outlines (the closest cell centers may all be on a single line if the spacing of one axis is much smaller than that of the other axis)
the most general index structure would be a variation of an “RTree”: we approximate the cells as bounding boxes and compare those while descending through the tree, and only at the leaf nodes do we actually perform expensive polygon comparisons
creating a RTree from the millions of grid cells from the source grid to search for neighbours of the millions of grid cells from the target grid (for dtos) can become expensive / consume too much memory
in that case, we can make use of a “distributed RTree” (there’s a bunch of different ways to implement this in research papers, for example the “DD-RTree” or the “SD-RTree”): divide the source grid into chunks and create a tree for the chunk borders and one tree each for the data within the chunks. If we also divide the target grid into chunks we can find the overlapping source chunks for each target chunk, and use the source chunk’s RTree to find the neighbouring / overlapping cells of each target cell.
for specific grid types we can define a more optimized neighbour search. For example, searching in the axis-aligned rectilinear grid is a matter of separately performing a linear interpolation of the grid coordinates (this becomes a bit more complicated if the poles / date line is involved):

index_lat = (lat_p - lat_0) / cell_spacing_lat
index_lon = (lon_p - lon_0) / cell_spacing_lon
neighbours_lat = [floor(index_lat), floor(index_lat), ceil(index_lat), ceil(index_lat)]
neighbours_lon = [floor(index_lon), ceil(index_lon), floor(index_lon), ceil(index_lon)]

a lot of the complexity of RTrees / distributed RTrees can be avoided by combining shuffle with the hierarchical nature of DGGS (an additional advantage is that there’s no discontinuities, i.e. poles / the dateline are already taken into account)
once we have the neighbouring / overlapping grid cells, the computation of the interpolation weights is relatively straightforward (but downscaling by a lot requires special care)

As a summary, I believe efficiently finding the neighbouring / overlapping cells is a bottleneck (so might have to be written in a fast language / numba) and is also something that is useful beyond regridding, so it might make sense to have as a separate library.

cc @maxrjones @norlandrhagen (and Tim Hodson, whose discourse handle I don’t know)

Michael_Sumner · September 5, 2024, 10:29pm

Nice, related package posted by Ryan here: Can a reprojection/change of CRS operation be done lazily using rioXarray? - #5 by rabernat

btw yesterday I found out GDAL can warp not only from geolocation-array grids (e.g. a curvlinear netcdf), but it can also warp to one . That cuts out a lot of jiggery pokery with point reprojection and look up (but obvsly not always appropriate given the metric of interest).

gist.github.com

https://gist.github.com/mdsumner/380fcc579e7c1736de189fa4e4fa67e9

warp_to_geol_grid.md

Take one band of a CMIP model and zero out the values (Float32) to create a template target grid, geolocation is by lon,lat arrays in the netcdf. 

Then warp another grid source to that.   

```bash
gdal_translate "vrt:///vsicurl/http://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/SIday/siconc/gn/v20190710/siconc_SIday_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_20150101-20191231.nc?bands=1&scale=0,100,0,0" target.nc
gdalwarp "vrt:///vsicurl/https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202408/oisst-avhrr-v02r01.20240819.nc?sd_name=sst" target.nc

```

This file has been truncated. show original

keewis · September 21, 2024, 7:21pm

I just realized that I’m about to try implementing a simpler version of the geobox index @rabernat was talking about in Example which highlights the limitations of NetCDF-style coordinates for large geospatial rasters - #26 by rabernat, as that really is one of the special cases for grid indexes I mentioned above (so with some delay the nerd-snipe might have been successful? )

allixender · December 6, 2024, 1:53pm

@Michael_Sumner AFAIK GDAL’s warping methods include a lot from nearest neighbour, mode, sum, bilinear and a lot other more statistical summary algorithms, but NOT a weighted overlap. I think that’s the one thing the met/ocean people really find essential that is not available in GDAL?

Michael_Sumner · December 9, 2024, 7:16pm

can you explain what weighted overlap is? ~I think that is what sum does, but only for a single input at a time~. (edit, sum does not do this:)

Does it mean “treat each grid as a layer of polygon pixels, with a target pixel layer and work out proportional quantity contributed to each target pix based on area-of-overlap”? (I just mean conceptually treat pixel as polygon, not necessarily materially). I really want to understand what other software does here, and it’s on my todo to spend time with esmf and others. I think we really need a small set of illustrative examples. There’s nothing as fast or easy as GDAL with its source-in → target extent/resolution/crs output model, but absolutely I agree we need to add algorithmic options. (my favourite is wind/current speed vectors, GDAL should be able to understand these and conserve quantities across wild reprojections, previously I’ve only ever done this in R with bespoke tricks)

keewis · December 10, 2024, 11:23am

can you explain what weighted overlap is?

This is usually called “conservative regridding”, and indeed computes weights according to the contribution based on the area-of-overlap (at least for first-order conservative, I didn’t look at higher-order conservative). You can read more about this in Conservative Region Aggregation with Xarray, Geopandas and Sparse

Automatic vector quantity treatment would be great, since that’s somewhat complex. According to Regridding Overview | Climate Data Guide this can be done by converting to vorticity / divergence, regridding, then deriving the vectors again.

Michael_Sumner · December 10, 2024, 2:28pm

thank you, understood! actually I was wrong up there, that is what sum does in the warper. (working on examples …)

allixender · December 11, 2024, 1:34pm

From the GDALwarp docs:

average: average resampling, computes the weighted average of all non-NODATA contributing pixels.
sum: compute the weighted sum of all non-NODATA contributing pixels (since GDAL 3.1)

I didn’t recall that it was weighted , I assumed it to be more naive and either take a pixels value or not, but not considering weighting. Interesting.

keewis · January 29, 2025, 11:16pm

As requested by @maxrjones in the meeting today, here’s the gist containing a low-level demo of how you’d use grid-indexing and grid-weights to do conservative regridding: conservative regridding using `grid_indexing` and `grid_weights` · GitHub

There’s still lots to do for this to be truly usable and sufficiently fast, but it does show the potential.

cc @jhamman @dcherian

Topic		Replies	Views
Distributed computation of regridding weights	3	146	April 9, 2025
What's Next - Software	5	466	December 8, 2023
Conservative Region Aggregation with Xarray, Geopandas and Sparse	30	5045	April 17, 2024
Interpolation and regridding Science	15	8477	November 3, 2021
Regridding data for clipping to shapefile - [Sentinel-3 L2 LST // SLSTR Instrument]	8	1330	March 22, 2022

What's Next - Software - Regridding

Related topics