Distributed computation of regridding weights

In the past few months I’ve been trying to write a set of libraries (grid-indexing, grid-weights) that compute conservative regridding weights in a way that is independent of the actual grids, using a similar approach as described by @rabernat in Conservative Region Aggregation with Xarray, Geopandas and Sparse

  1. construct cell boundary polygons
  2. compute weights by computing the fractional area overlap (decomposed in two steps):
    1. eliminate combinations of source / target cells as soon as possible using an R*Tree (the result is a sparse boolean matrix)
    2. compute the area overlap and arrange the weights as a sparse floating-point matrix
  3. perform the regridding operation using a sparse matrix multiplication (using einsum)

The main difference to the earlier post is that instead of using geopandas I’m relying on the packages in the georust ecosystem and the geoarrow-rs library. Additionally, grid-indexing provides a dask wrapper of the main R*Tree, which in theory allows the computation of regridding weights on a dask cluster, and thus to scale to large grids. I say in theory, because there’s still a lot of unsolved problems. The result is two packages:

This still needs a proper API (see keewis/grid-indexing#35), but for now you’d use it like this:

from grid_indexing.distributed import DistributedRTree
from grid_weights.convervative import conservative_weights

input_ds = ...
chunked_source_geoms = ... # construct source geometries as a dask array
chunked_target_geoms = ... # construct target geometries as a dask array

index = DistributedRTree(chunked_source_geoms)
overlapping_cells = index.query_overlap(chunked_target_geoms)
# write `overlapping_cells` to disk if desired
weights = conservative_weights(
    chunked_source_geoms,
    chunked_target_geoms,
    overlapping_cells,
)
# write `weights` to disk if desired
target_grid = ...
weights_ = ... # wrap weights in a xarray object
regridded = regrid(ds, weights_)

For the full example see this gist

how this works

indexing

The in-memory index is a variant of an RTree (specifically, a R*Tree), based on the rstar rust library. It works by computing bounding boxes of the source geometries which are then used to construct a tree of bounding boxes, where each parent node’s box fully contains its child boxes. When querying for the overlap with a polygon, we first compute its bounding box, and use that to traverse the tree until we find a overlapping leaf box. This gives us a list of candidate polygons we then have to manually intersect with the given polygon to eliminate the false positives. The result of this operation is a sparse boolean matrix.

If we have chunked arrays of polygons as input, the approach becomes slightly different: we take the first array of polygons (the source cell polygons when regridding) and construct a in-memory tree of the chunk boundary polygons by applying a set union to each chunk and loading the result into memory. Then we also compute the chunk boundary polygons of the other array (the target cell polygons) and query the in-memory tree to figure out which chunk of source cell polygons “interacts” with which chunk of target cell polygons. Once that is done, we can create one tree per source chunk, and have each “interacting” target chunk query for overlaps. The result is a chunked sparse boolean matrix.

computing the weights

To compute the weights, we take the source and target polygons as well as the sparse boolean matrix returned by the indexing step, we can project the polygons to a local area-preserving projection (for accuracy, but also to avoid discontinuities) and compute the area of overlap, which we can immediately divide by the area of the target polygon to get the weights.

This approach will, however, not show the desired result for cells on the boundary of a domain or along discontinuities (only until the 3D index works properly), so we have to normalize the weights by dividing by the sum of the weights for a particular target cell (we’ll have to use some tricks, though, to keep the sparse weights matrix sparse).

Future tasks

The libraries are still experimental and my tests show that at some point the regridding pipeline is too big for the task scheduler. Thus, there’s still a lot to be done, especially:

  • figuring out how to deal with the spherical / ellipsoidal nature of geographic coordinates (both when searching cells and when computing weights)
  • allowing to write sparse matrices to disk (zarr, in particular)
  • optimizing the task graph to actually make it possible to scale out
  • finding a good public API (see this issue)

However, with all that said I think the demo clearly shows the potential of these libraries.

dealing with the curved nature of geographic coordinates

This is not an issue when dealing with grids that don’t include the coordinate space’s discontinuities (e.g. the dateline / poles for geographic coordinates), but for global data this becomes tricky.

In particular, if we search neighbours in the original coordinate space, e.g. a standard rectilinear lon / lat grid, cells that are adjacent to the dateline won’t have the same number of neighbours as cells that are far away from the dateline (and this becomes even worse at the poles). pyinterp resolves this by converting the 2D geographic coordinates to 3D cartesian, after which the search does not have the same geometric problems (at the cost of making polygon intersection more expensive). You could also imagine using a DGGS like H3 to define bins and search that way (but I haven’t fully thought through how you would actually implement that in a robust way, in particular how to choose the bins such that we don’t have to do too many polygon intersections).

Also, when computing the weights we have to reproject the polygons to a (local) area-preserving projection to avoid area distortion. Choosing a local projection avoids discontinuities, but once again increases the computational effort. (grid-weights doesn’t do this at the moment, so the weights calculated by it will be off, see keewis/grid-weights#5)

writing sparse matrices to disk

If I understand correctly, we still don’t have a good way to write sparse matrices to disk and seamless open them back into memory. binsparse-python allows writing sparse arrays by writing the different underlying arrays of the sparse matrix to separate zarr arrays within a special zarr group, but you’d have to use binsparse-python to load these back into memory, and this isn’t as convenient as calling xr.open_dataset(store, engine="zarr") to get a dataset with sparse matrices.

cc @TomNicholas, @tomwhite, @maxrjones, @norlandrhagen, @weiji14, @alxmrs, @rsignell

8 Likes

This is incredible work, and a lot of details to comment on. Pangeo showcase talk when??

4 Likes

Pangeo showcase talk when??

Perhaps a lightnight talk? :wink: Otherwise I’ll start scheduling fall talks in mid-summer and think this would be super cool to learn about if there’s interest in presenting.

1 Like

I definitely want to present this! However, I don’t think I’ll be able to fit this into a lightning talk while going into sufficient details (if I had an API that allows me to just demonstrate the library that would be different), so I’ll have to postpone until the fall schedule (or at least until the next opportunity, whenever that will be)

1 Like