Conservative Region Aggregation with Xarray, Geopandas and Sparse

If you just want the full notebook, it’s here: https://notebooksharing.space/view/c6c1f3a7d0c260724115eaa2bf78f3738b275f7f633c1558639e7bbd75b31456

Context

Conservative regridding is a important an expensive computational operation in climate science. As opposed to nearest-neighbor interpolation or reprojecting from one CRS to another, conservative regridding needs to account for the full geometry of the source and target grid. Currently our main tool for conservative regridding is xESMF.

Although it is usually discussed in a different context, conservative regridding is funamentally similar to regional aggregation, as performed for example by the xagg package; both methods require consideration of the cell / region geometry.

So to use GIS terminology, regridding is a many-to-many spatial join operation.

Goals

We generally rely on packages for regridding. But none of the packages has the performance and features that users really need. (For example, xesmf still cannot run with dask distributed.) Moreover, xesmf is a heavy dependency; it won’t run on windows for example. There is an appetite for alternatives.

Meanwhile. Geopandas has recently gained a lot of great functionality and performance enhancements, meaning that is could be suitable for doing this join. I wanted to explore coding up this sort of regridding from scratch, without using a regridding package, and see how fast it could go.

Goal: Regrid a global precipitation dataset into countries conservatively, i.e. by exactly partitioning each grid cell into the precise region boundaries.

Meta Goal: Demonstrate that we don’t necessarily need a package for this workflow and showcase some of the new capabilities of GeoPandas and Xarray along the way.

My source dataset is the NASA GPCP dataset, via Pangeo Forge in Zarr format. It’s a global, daily precipitation dataset with 1x1 degree spatial resolution. The target regions are the Natural Earth 50m admin boundaries, loaded via a shapefile.

Approach

I take a three step approach:

  • Represent both the original grid and target grid as GeoSeries with Polygon geometry
  • Compute their area overlay and turn it into a sparse matrix
  • Perform matrix multiplication on the full Xarray dataset (with a time dimension)

Some Highlights

Check out the full notebook for details. Here are some highlights.

The original dataset


store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'
ds = xr.open_dataset(store, engine='zarr', chunks={})

Creating shapely geometries from grid bounds

points = grid.stack(point=("latitude", "longitude"))
boxes = xr.apply_ufunc(
    bounds_to_poly,
    points.lon_bounds,
    points.lat_bounds,
    input_core_dims=[("nv",),  ("nv",)],
    output_dtypes=[np.dtype('O')],
    vectorize=True
)
boxes

Converting that to a GeoDataframe

grid_df= gp.GeoDataFrame(
    data={"geometry": boxes.values, "latitude": boxes.latitude, "longitude": boxes.longitude},
    index=boxes.indexes["point"],
    crs=crs_orig
)

Overlaying this with the region geometries and computing area weights

overlay = grid_df.overlay(regions_df)
grid_cell_fraction = (
    overlay.geometry.area.groupby(overlay.SOVEREIGNT)
    .transform(lambda x: x / x.sum())
)

Turning this into a sparse Xarray dataset

multi_index = overlay.set_index(["latitude", "longitude", "SOVEREIGNT"]).index
df_weights = pd.DataFrame({"weights": grid_cell_fraction.values}, index=multi_index)
ds_weights = xr.Dataset(df_weights)
weights_sparse = ds_weights.unstack(sparse=True, fill_value=0.).weights

Applying the matrix multiplication

Note that we can’t just use xr.dot because of einsum implementation · Issue #31 · pydata/sparse · GitHub.

def apply_weights_matmul_sparse(weights, data):

    assert isinstance(weights, sparse.SparseArray)
    assert isinstance(data, np.ndarray)
    data = sparse.COO.from_numpy(data)
    data_shape = data.shape
    # k = nlat * nlon
    n, k = data_shape[0], data_shape[1] * data_shape[2]
    data = data.reshape((n, k))
    weights_shape = weights.shape
    k_, m = weights_shape[0] * weights_shape[1], weights_shape[2]
    assert k == k_
    weights_data = weights.reshape((k, m))

    regridded = sparse.matmul(data, weights_data)
    assert regridded.shape == (n, m)
    return regridded.todense()

precip_regridded = xr.apply_ufunc(
    apply_weights_matmul_sparse,
    weights_sparse,
    precip_in_mem,
    join="left",
    input_core_dims=[["latitude", "longitude", "SOVEREIGNT"], ["latitude", "longitude"]],
    output_core_dims=[["SOVEREIGNT"]],
    dask="parallelized",
    meta=[np.ndarray((0,))]
)

# takes < 10s to regrid over 9000 timesteps
precip_regridded.load()

Plot some data

precip_regridded.sel(SOVEREIGNT="Italy").resample(time="MS").mean().plot()

Screen Shot 2022-09-07 at 10.39.28 AM

Thoughts

I think this is a promising way forward for regridding. It removes the xESMF dependency. The challenge will be scaling it up to ultra-high-resolution global grids with millions of points. I will explore that in a follow-up post.

12 Likes

Hi Ryan, timely topic as I am doing a deep dive on gridding w/r/t xarray.

Any downsides to this compared to xESMF (besides more lines of code written)?

Wrestling with the no package vs package pro’s and con’s. Thanks for sharing this!

1 Like

I’m sure there are many downsides compared to xESMF. I have not attempted to enumerate the pros and cons yet. I’m still at the exploratory stage. If you’re interested in this topic, I’d love to hear your thoughts!

For the case of interpolation between rectilinear grids (even on the sphere), you can factorize regridding along each axis. This is less general but makes the entire calculation much simpler, because its feasible to store interpolation weights as dense matrices and to use dense matrix multiplication.

I have some code doing this at scale with Xarray-Beam that we’ve been running on 0.25x0.25 degree inputs. We should be able to source open source it soon if there is interest.

7 Likes

This work looks really great @rabernat ! We also have been working on a python package for what I’d call grid-to-polygon area weighted intersections or aggregations based on the methods presented here: R-tree Spatial Indexing with Python – Geoff Boeing. The repo is available here: Water Mission Area / nhgf / ToolsTeam / gdptools · GitLab (usgs.gov). Still in development. There are 2 workflows available: 1) using Mike Johnsons openDAP catalog as a source for gridded data (see: https://mikejohnson51.github.io/opendap.catalog/cat_params.json) and 2) based on a user-defined gridded source. Hope to have more use-cases available soon in the “preliminary” documentation here: Simple openDAP Catalog example. — gdptools

2 Likes

thanks! @rabernat is the zarr url meant to be generally accessible? I can’t tell if my tooling is off , or if I need to be on a system?

1 Like

Welcome Michael!

The data should be 100% public. They are stored on Open Storage Network and should be accessible anywhere on the internet. To verify you can access them, take python out of the loop and try running

$ curl https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr/.zgroup

from the command line. You should see

{
    "zarr_format": 2
}

great thanks a lot! I’m totally flailing until I get my tooling practice in - this is a whole-example where I can see my way through to compare some stuff in R - I’m inspired to finally get my python env setup

2 Likes

Getting a new python environment set up unfortunately can be tricky. If you want something that just works out of the box, check out Pangeo Docker Images (you can use them locally or in the cloud)

2 Likes

Thanks for sharing this @rabernat. That’s the kind of use case I’d like to consider for vectorized Python bindings of s2geometry and/or s2geography, to be eventually (hopefully) integrated with Geopandas (and Xarray too, via a geo-extension, why not?!).

This general approach is a nice complement to @shoyer’s special case for rectilinear grids. However, a solution based on shapely (GEOS) planar geometries is still sub-optimal for lat / lon grids where the data projection step may be expensive. Also, naive question: do the polygon shapes distorted by the equal-area projection, combined with their limited number of vertices (grid and/or region geometries) and/or their extent, have any significant effect on the accuracy of the computed area weights?

s2geometry has a lot of features that we could potentially leverage in Python (vectorized) bindings. For example, it provides a GetOverlapFractions function that seems well fitted for this purpose.

2 Likes

If it is acceptable to perform conservative aggregation of arbitrary polygons at a given finite resolution, I think that something based on s2geometry (or H3) cells may offer great speed-ups. Both libraries support large to very fine spatial resolutions (cf. @darothen’s comment).

Like xESMF, a solution based on s2geometry or h3 would also introduce some heavy dependencies, but not heavier than GEOS used by shapely.

This is a great question. I don’t know enough about geography and projections to answer it. In the notebook I did a check that the country areas are preserved. But I suppose the distortions in geometry could affect the weights in other ways.

So for this use case, you are suggesting we first go from lat-lon grid to an arbitrary super high-resolution S3 or H3 grid, then to regions? The intermediate step would introduce some resampling error, and would require a two-stage regridding. Why are you so sure this would be faster? I’d love to see an example of that.

So for this use case, you are suggesting we first go from lat-lon grid to an arbitrary super high-resolution S3 or H3 grid, then to regions? The intermediate step would introduce some resampling error, and would require a two-stage regridding. Why are you so sure this would be faster? I’d love to see an example of that.

Yeah I wrote that a bit too quickly and I’m not sure of anything actually :slight_smile:

With S2 you can approximate any arbitrary region as a union of cells, i.e., S2CellUnion via S2RegionCoverer, with more or less precision (e.g., examples here). S2CellUnion provides the API to compute the area of a cell union and compute the intersection between two cell unions, which I thought would allow some optimization vs. the same computation based on the exact polygon geometries (at the expense of some loss in precision, which can be controlled by parameters). I haven’t looked more into it, though, so I might be completely wrong!

Thank you for sharing! the low-dependency calculation of grid weights for arbitrary geometries is very useful.

A comment on the final weight application (which I know isn’t the key part of this demo) - I haven’t been able to see much performance using sparse matrices (on my setup, it takes ~20s to compute the dot product with your custom ufunc). I have found that using dask.tensordot on dense arrays (with a bit more massaging) gives a faster solution (~3x faster in my case).

# at the top of your "Perform Matrix Multiplication" section
import dask
weight_da = ds_weights.unstack(fill_value=0).weights #no sparse

good_lats = np.isin(precip.latitude.values, weight_da.latitude.values)
precip = precip.isel(latitude=good_lats) # to match lat/lon dims of weights and precip data
precip_in_mem = precip.compute().chunk({"time": "10MB"})
agged_td = dask.array.tensordot(precip_in_mem.data, weight_da, axes=2).compute()
precip_regridded_td = xr.DataArray(data=agged_td, coords=[precip_in_mem['time'], weight_da['SOVEREIGNT']])

Might just be my particular setup, or a quirk of dask better optimizing tensordot vs. a generic ufunc.

2 Likes

This would be a bigger problem if you’d reproject the grid than country geometries. Larger grids defined by 4 planar points only tend to get messed up in retrojecting and you should densify then before doing that. You can check pyproj’s solution for reprojecting bounding boxes.

Also, I should note that a lot of work on interpolation from polygons to polygons has been done in the Tobler package and some of its internals may give a bit of speedup over using overlay that comes with a bit of overhead if you are interested only in the weights matrix.

2 Likes

I slightly extended the notebook from @rabernat to show a map of mean precip by region (with a little help from @chegint and @ocefpaf): https://nbviewer.org/gist/b02042191c94b9882af771f13d95693f

6 Likes

Rich that is awesome! :star_struck:

@benbovy @rabernat Hey, just tagging here. With @annefou and @tinaok we would be keen to explore that more systematically from the DGGS <-> unstructured grids side (and myself in general region aggregation over “raster” or “vector” data with DGGS anyway). Just pasting the link to my thread here where we hope to pick up the slack a bit Discrete Global Grid Systems (DGGS) use with Pangeo - #25 by allixender

1 Like

@rabernat, in your notebook you mention that it would be nice if this just worked:

regridded = xr.dot(ds[var], weights_sparse)

but pointed out that unfortunately einsum had not been implemented in sparse.

There is a PR here that looks promising: einsum by jcmgray · Pull Request #564 · pydata/sparse · GitHub and the author is looking for testers.

I tried doing:

regridded = sparse.einsum('ij,jk->ik', ds[var], weights_sparse)

but got back:

AttributeError: 'DataArray' object has no attribute 'reshape'

I don’t really know what I’m doing here, so I stopped and posted. :slight_smile:

@rsignell did you try doing

regridded = xr.dot(ds[var], weights_sparse)

with that particular sparse PR installed? Then if your dataset is backed by a sparse array then xr.dot should dispatch to the sparse version of einsum…

If it’s still not working at that point we should raise an xarray issue.