If you just want the full notebook, it’s here: https://notebooksharing.space/view/c6c1f3a7d0c260724115eaa2bf78f3738b275f7f633c1558639e7bbd75b31456

## Context

Conservative regridding is a important an expensive computational operation in climate science. As opposed to nearest-neighbor interpolation or reprojecting from one CRS to another, conservative regridding needs to account for the full geometry of the source and target grid. Currently our main tool for conservative regridding is xESMF.

Although it is usually discussed in a different context, conservative regridding is funamentally similar to regional aggregation, as performed for example by the xagg package; both methods require consideration of the cell / region geometry.

So to use GIS terminology, regridding is a many-to-many spatial join operation.

## Goals

We generally rely on packages for regridding. But none of the packages has the performance and features that users really need. (For example, xesmf still cannot run with dask distributed.) Moreover, xesmf is a heavy dependency; it won’t run on windows for example. There is an appetite for alternatives.

Meanwhile. Geopandas has recently gained a lot of great functionality and performance enhancements, meaning that is could be suitable for doing this join. I wanted to explore coding up this sort of regridding from scratch, without using a regridding package, and see how fast it could go.

**Goal:** Regrid a global precipitation dataset into countries *conservatively*, i.e. by exactly partitioning each grid cell into the precise region boundaries.

**Meta Goal:** Demonstrate that we don’t necessarily need a package for this workflow and showcase some of the new capabilities of GeoPandas and Xarray along the way.

My source dataset is the NASA GPCP dataset, via Pangeo Forge in Zarr format. It’s a global, daily precipitation dataset with 1x1 degree spatial resolution. The target regions are the Natural Earth 50m admin boundaries, loaded via a shapefile.

## Approach

I take a three step approach:

- Represent both the original grid and target grid as GeoSeries with Polygon geometry
- Compute their area overlay and turn it into a sparse matrix
- Perform matrix multiplication on the full Xarray dataset (with a time dimension)

## Some Highlights

Check out the full notebook for details. Here are some highlights.

### The original dataset

```
store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'
ds = xr.open_dataset(store, engine='zarr', chunks={})
```

### Creating shapely geometries from grid bounds

```
points = grid.stack(point=("latitude", "longitude"))
boxes = xr.apply_ufunc(
bounds_to_poly,
points.lon_bounds,
points.lat_bounds,
input_core_dims=[("nv",), ("nv",)],
output_dtypes=[np.dtype('O')],
vectorize=True
)
boxes
```

### Converting that to a GeoDataframe

```
grid_df= gp.GeoDataFrame(
data={"geometry": boxes.values, "latitude": boxes.latitude, "longitude": boxes.longitude},
index=boxes.indexes["point"],
crs=crs_orig
)
```

### Overlaying this with the region geometries and computing area weights

```
overlay = grid_df.overlay(regions_df)
grid_cell_fraction = (
overlay.geometry.area.groupby(overlay.SOVEREIGNT)
.transform(lambda x: x / x.sum())
)
```

### Turning this into a sparse Xarray dataset

```
multi_index = overlay.set_index(["latitude", "longitude", "SOVEREIGNT"]).index
df_weights = pd.DataFrame({"weights": grid_cell_fraction.values}, index=multi_index)
ds_weights = xr.Dataset(df_weights)
weights_sparse = ds_weights.unstack(sparse=True, fill_value=0.).weights
```

### Applying the matrix multiplication

Note that we can’t just use `xr.dot`

because of einsum implementation · Issue #31 · pydata/sparse · GitHub.

```
def apply_weights_matmul_sparse(weights, data):
assert isinstance(weights, sparse.SparseArray)
assert isinstance(data, np.ndarray)
data = sparse.COO.from_numpy(data)
data_shape = data.shape
# k = nlat * nlon
n, k = data_shape[0], data_shape[1] * data_shape[2]
data = data.reshape((n, k))
weights_shape = weights.shape
k_, m = weights_shape[0] * weights_shape[1], weights_shape[2]
assert k == k_
weights_data = weights.reshape((k, m))
regridded = sparse.matmul(data, weights_data)
assert regridded.shape == (n, m)
return regridded.todense()
precip_regridded = xr.apply_ufunc(
apply_weights_matmul_sparse,
weights_sparse,
precip_in_mem,
join="left",
input_core_dims=[["latitude", "longitude", "SOVEREIGNT"], ["latitude", "longitude"]],
output_core_dims=[["SOVEREIGNT"]],
dask="parallelized",
meta=[np.ndarray((0,))]
)
# takes < 10s to regrid over 9000 timesteps
precip_regridded.load()
```

### Plot some data

```
precip_regridded.sel(SOVEREIGNT="Italy").resample(time="MS").mean().plot()
```

## Thoughts

I think this is a promising way forward for regridding. It removes the xESMF dependency. The challenge will be scaling it up to ultra-high-resolution global grids with millions of points. I will explore that in a follow-up post.