Since there appears to be a sufficient amount of people interested, I’m opening a separate topic on regridding (Ryan’s topic is focused on conservative regridding, which is important but certainly not the only regridding method).
(I have something like this working in https://github.com/IAOCEA/xarray-healpy, where the name is not that descriptive… it really is just general regridding using a tree and numba to do bilinear interpolation)
IMO this would be a great project for an institution to take ownership of - it requires sustained effort to flesh out, but would certainly be extremely widely used. But anyone interested in that should probably go to the thread linked above.
I will note that while does work in a few very specific cases, there’s a lot to fix / work on, and I haven’t gotten around to actually verifying that the interpolation does what it should (besides closely looking at plots of the original and the result).
The idea is that most (if not all) interpolation methods can be split into several steps:
find the neighbors of a target cell in the source grid (this tends to be pretty expensive in the general case)
compute the weights and put them into sparse matrix form
apply the weights by performing a sparse matrix multiplication
Step 1 and 2 are usually combined, but I believe that exposing step 1 separately is useful beyond just interpolation / regridding.
Some additional features I’d like to have:
it should be possible to compute the weights even if the source grid, the target grid or both don’t fit into memory
it should support regridding using cell geometries, not just cell centers / cell bounds
In any case, I’d be happy to collaborate on figuring out how to get there (whether that be by extending a existing library or writing one from scratch), though I will say that I don’t have a lot of prior experience in this space.
the GDALWarp library function already handles a lot of this, I know it doesn’t suit the array-focused community so much but there’s a lot there and with some attention on the multidim model a lot could be done here. For my interest I want to fix some longitude wrapping issues with geolocation arrays used with the warper, it would fix a whole lot of workflow problems we have.
I’ll be trying to get across the way xarray and c sees this space and work examples
(I know next to nothing about GDAL, so my views might be a misconception)
The reason why right now we can’t really rely on esmf / xesmf is that it is still tricky to install them, even from conda-forge (which made things better but not perfect), and it is still not possible to compute the weights in parallel on a dask cluster: we’d need to use MPI and a command line program for this, which means we’re back to the old paradigm of transformations between files – i.e. no streaming of the data. It might be possible to resolve both of these issues, but I don’t know how easy that would be (but see also the reasoning given in Conservative Region Aggregation with Xarray, Geopandas and Sparse on why a lighter-weight library might be preferable).
For similar reasons I’m somewhat sceptical about using GDAL for this (though again, I just might not know GDAL well enough): my conception is that GDAL had similar issues with installing in the past (not sure about now). More fundamentally, though, I was under the impression that GDAL warp worked mostly on images, is that right? This would mean rectangular pixels arranged in 2D arrays, which would exclude the DGGS that I’d very much like to support as well, and which can typically only be represented as a 1D array of faces.
To be fully explicit, I think the requirements we have for a general regridding packages are (and I might be missing some):
cross-platform
easy to install (conda-forge helps a lot here)
works with dask or other chunked arrays (both when computing weights as well as when applying them)
allows regridding between large, potentially larger than memory grids
close to arbitrary cell geometries (my personal motivation for this are DGGS)
no it works with geolocation arrays, and for any numeric type , swaths or model output etc. They can be rectilinear or curvilinear (I expect offset grids aren’t entirely correctly dealt with here but it goes a long way and better than a lot of workflows I’ve seen). You set the output extent and dimension (or resolution) and crs, and set the resampling algorithm and configuration options and it does the rest. This generalizes across image subsampling and warping for regular or curvilinear sources, files, databases , urls in a way I don’t see in any other free software, and with virtualization gor various stages with VRT
installing complex libs is hard but support is good, I don’t have problems now on standard linux with Python and R if I avoid conda et al. Admittedly I haven’t got esmf going but I haven’t put effort in yet.
I see DGGS as sitting squarely in a rasterizing of polygons setting (with appropriate mesh efficiencies avoiding materializing as “dumb polygons”), but indeed I would want to see that integrated - but the pieces all exist in core libs
Coming back to this, I am still somewhat skeptical about pulling in GDAL if we don’t use it otherwise, as it is a very heavy dependency. However, for environments where it is installed anyways it would definitely make sense to use it (and from what I can tell rioxarray and odc-geo interface with it through rasterio to do reprojection, and odc-geo even has a dask-aware version – I never tried either of these, though).
So for workflows that don’t already involve GDAL that still leaves us with the attempt to write a new library. I am also not sure if it is a good idea to start entirely from scratch, but I can’t really find any software that has the properties I’m looking for (I didn’t do an exhaustive search either, though).
Looking at the problem itself, I think the hardest part is the analysis of the source grid for the neighbor search (for distance-based interpolation algorithms like nearest-neighbor or IDW a KDTree with an appropriate metric might be sufficient already). Once the neighbor search is done the implementation of the algorithm to produce a sparse matrix of weights should be relatively straight-forward, and the application of weights works already very well.
I agree with wll this and appreciate the feedback, there’s just a lot can already be done with GDAL and so maybe starting again isn’t entirely the best way. rasterio is not GDAL, and its widespread use and stand-in for the role plays as a bit of a mask. But, where and how I can or even should say stuff like this is not entirely clear. It’s the same in R, proxy packages set the scene but mask the foundation (sometimes for good reason, downstream flow can take time but sometimes it doesn’t get visibility when you’re well down the river).
Here’s what I’ve found out since last post (I still don’t feel like an expert on this, though):
regridding algorithms work either by mapping source points to the target / destination grid (s2d in xesmf / esmf), or by mapping destination points to the source grid (d2s). Either way we need to get both grids into the same coordinate system.
For any interpolation algorithm we need to be able to (efficiently) find neighbouring / overlapping grid cells in a given grid
we can’t just use standard tree structures (kdtree, balltree with an appropriate metric): those only allow for searching neighbouring cell centers, but really we’re interested in cell / pixel outlines (the closest cell centers may all be on a single line if the spacing of one axis is much smaller than that of the other axis)
the most general index structure would be a variation of an “RTree”: we approximate the cells as bounding boxes and compare those while descending through the tree, and only at the leaf nodes do we actually perform expensive polygon comparisons
creating a RTree from the millions of grid cells from the source grid to search for neighbours of the millions of grid cells from the target grid (for dtos) can become expensive / consume too much memory
in that case, we can make use of a “distributed RTree” (there’s a bunch of different ways to implement this in research papers, for example the “DD-RTree” or the “SD-RTree”): divide the source grid into chunks and create a tree for the chunk borders and one tree each for the data within the chunks. If we also divide the target grid into chunks we can find the overlapping source chunks for each target chunk, and use the source chunk’s RTree to find the neighbouring / overlapping cells of each target cell.
for specific grid types we can define a more optimized neighbour search. For example, searching in the axis-aligned rectilinear grid is a matter of separately performing a linear interpolation of the grid coordinates (this becomes a bit more complicated if the poles / date line is involved):
a lot of the complexity of RTrees / distributed RTrees can be avoided by combining shuffle with the hierarchical nature of DGGS (an additional advantage is that there’s no discontinuities, i.e. poles / the dateline are already taken into account)
once we have the neighbouring / overlapping grid cells, the computation of the interpolation weights is relatively straightforward (but downscaling by a lot requires special care)
As a summary, I believe efficiently finding the neighbouring / overlapping cells is a bottleneck (so might have to be written in a fast language / numba) and is also something that is useful beyond regridding, so it might make sense to have as a separate library.
btw yesterday I found out GDAL can warp not only from geolocation-array grids (e.g. a curvlinear netcdf), but it can also warp to one . That cuts out a lot of jiggery pokery with point reprojection and look up (but obvsly not always appropriate given the metric of interest).