What's Next - Software - Regridding

Since there appears to be a sufficient amount of people interested, I’m opening a separate topic on regridding (Ryan’s topic is focused on conservative regridding, which is important but certainly not the only regridding method).

From What's Next - Software - #4 by jbednar by @TomNicholas:

The regridding bullet point was an attempt to nerd-snipe someone(s) into fleshing out the prototype that Ryan showed is possible (here Conservative Region Aggregation with Xarray, Geopandas and Sparse).

@keewis also said

(I have something like this working in https://github.com/IAOCEA/xarray-healpy, where the name is not that descriptive… it really is just general regridding using a tree and numba to do bilinear interpolation)

IMO this would be a great project for an institution to take ownership of - it requires sustained effort to flesh out, but would certainly be extremely widely used. But anyone interested in that should probably go to the thread linked above.

I will note that while does work in a few very specific cases, there’s a lot to fix / work on, and I haven’t gotten around to actually verifying that the interpolation does what it should (besides closely looking at plots of the original and the result).

The idea is that most (if not all) interpolation methods can be split into several steps:

  1. find the neighbors of a target cell in the source grid (this tends to be pretty expensive in the general case)
  2. compute the weights and put them into sparse matrix form
  3. apply the weights by performing a sparse matrix multiplication

Step 1 and 2 are usually combined, but I believe that exposing step 1 separately is useful beyond just interpolation / regridding.

Some additional features I’d like to have:

  • it should be possible to compute the weights even if the source grid, the target grid or both don’t fit into memory
  • it should support regridding using cell geometries, not just cell centers / cell bounds

In any case, I’d be happy to collaborate on figuring out how to get there (whether that be by extending a existing library or writing one from scratch), though I will say that I don’t have a lot of prior experience in this space.

1 Like

the GDALWarp library function already handles a lot of this, I know it doesn’t suit the array-focused community so much but there’s a lot there and with some attention on the multidim model a lot could be done here. For my interest I want to fix some longitude wrapping issues with geolocation arrays used with the warper, it would fix a whole lot of workflow problems we have.

I’ll be trying to get across the way xarray and c sees this space and work examples

(I know next to nothing about GDAL, so my views might be a misconception)

The reason why right now we can’t really rely on esmf / xesmf is that it is still tricky to install them, even from conda-forge (which made things better but not perfect), and it is still not possible to compute the weights in parallel on a dask cluster: we’d need to use MPI and a command line program for this, which means we’re back to the old paradigm of transformations between files – i.e. no streaming of the data. It might be possible to resolve both of these issues, but I don’t know how easy that would be (but see also the reasoning given in Conservative Region Aggregation with Xarray, Geopandas and Sparse on why a lighter-weight library might be preferable).

For similar reasons I’m somewhat sceptical about using GDAL for this (though again, I just might not know GDAL well enough): my conception is that GDAL had similar issues with installing in the past (not sure about now). More fundamentally, though, I was under the impression that GDAL warp worked mostly on images, is that right? This would mean rectangular pixels arranged in 2D arrays, which would exclude the DGGS that I’d very much like to support as well, and which can typically only be represented as a 1D array of faces.

To be fully explicit, I think the requirements we have for a general regridding packages are (and I might be missing some):

  • cross-platform
  • easy to install (conda-forge helps a lot here)
  • works with dask or other chunked arrays (both when computing weights as well as when applying them)
  • allows regridding between large, potentially larger than memory grids
  • close to arbitrary cell geometries (my personal motivation for this are DGGS)
  • (sufficiently) high performance
1 Like

no it works with geolocation arrays, and for any numeric type , swaths or model output etc. They can be rectilinear or curvilinear (I expect offset grids aren’t entirely correctly dealt with here but it goes a long way and better than a lot of workflows I’ve seen). You set the output extent and dimension (or resolution) and crs, and set the resampling algorithm and configuration options and it does the rest. This generalizes across image subsampling and warping for regular or curvilinear sources, files, databases , urls in a way I don’t see in any other free software, and with virtualization gor various stages with VRT

installing complex libs is hard but support is good, I don’t have problems now on standard linux with Python and R if I avoid conda et al. Admittedly I haven’t got esmf going but I haven’t put effort in yet.

I see DGGS as sitting squarely in a rasterizing of polygons setting (with appropriate mesh efficiencies avoiding materializing as “dumb polygons”), but indeed I would want to see that integrated - but the pieces all exist in core libs :pray:

Coming back to this, I am still somewhat skeptical about pulling in GDAL if we don’t use it otherwise, as it is a very heavy dependency. However, for environments where it is installed anyways it would definitely make sense to use it (and from what I can tell rioxarray and odc-geo interface with it through rasterio to do reprojection, and odc-geo even has a dask-aware version – I never tried either of these, though).

So for workflows that don’t already involve GDAL that still leaves us with the attempt to write a new library. I am also not sure if it is a good idea to start entirely from scratch, but I can’t really find any software that has the properties I’m looking for (I didn’t do an exhaustive search either, though).

Looking at the problem itself, I think the hardest part is the analysis of the source grid for the neighbor search (for distance-based interpolation algorithms like nearest-neighbor or IDW a KDTree with an appropriate metric might be sufficient already). Once the neighbor search is done the implementation of the algorithm to produce a sparse matrix of weights should be relatively straight-forward, and the application of weights works already very well.

1 Like

I agree with wll this and appreciate the feedback, there’s just a lot can already be done with GDAL and so maybe starting again isn’t entirely the best way. rasterio is not GDAL, and its widespread use and stand-in for the role plays as a bit of a mask. But, where and how I can or even should say stuff like this is not entirely clear. It’s the same in R, proxy packages set the scene but mask the foundation (sometimes for good reason, downstream flow can take time but sometimes it doesn’t get visibility when you’re well down the river). :pray: