If you just want the full notebook, it’s here: https://notebooksharing.space/view/c6c1f3a7d0c260724115eaa2bf78f3738b275f7f633c1558639e7bbd75b31456
Context
Conservative regridding is a important an expensive computational operation in climate science. As opposed to nearest-neighbor interpolation or reprojecting from one CRS to another, conservative regridding needs to account for the full geometry of the source and target grid. Currently our main tool for conservative regridding is xESMF.
Although it is usually discussed in a different context, conservative regridding is funamentally similar to regional aggregation, as performed for example by the xagg package; both methods require consideration of the cell / region geometry.
So to use GIS terminology, regridding is a many-to-many spatial join operation.
Goals
We generally rely on packages for regridding. But none of the packages has the performance and features that users really need. (For example, xesmf still cannot run with dask distributed.) Moreover, xesmf is a heavy dependency; it won’t run on windows for example. There is an appetite for alternatives.
Meanwhile. Geopandas has recently gained a lot of great functionality and performance enhancements, meaning that is could be suitable for doing this join. I wanted to explore coding up this sort of regridding from scratch, without using a regridding package, and see how fast it could go.
Goal: Regrid a global precipitation dataset into countries conservatively, i.e. by exactly partitioning each grid cell into the precise region boundaries.
Meta Goal: Demonstrate that we don’t necessarily need a package for this workflow and showcase some of the new capabilities of GeoPandas and Xarray along the way.
My source dataset is the NASA GPCP dataset, via Pangeo Forge in Zarr format. It’s a global, daily precipitation dataset with 1x1 degree spatial resolution. The target regions are the Natural Earth 50m admin boundaries, loaded via a shapefile.
Approach
I take a three step approach:
- Represent both the original grid and target grid as GeoSeries with Polygon geometry
- Compute their area overlay and turn it into a sparse matrix
- Perform matrix multiplication on the full Xarray dataset (with a time dimension)
Some Highlights
Check out the full notebook for details. Here are some highlights.
The original dataset
store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'
ds = xr.open_dataset(store, engine='zarr', chunks={})
Creating shapely geometries from grid bounds
points = grid.stack(point=("latitude", "longitude"))
boxes = xr.apply_ufunc(
bounds_to_poly,
points.lon_bounds,
points.lat_bounds,
input_core_dims=[("nv",), ("nv",)],
output_dtypes=[np.dtype('O')],
vectorize=True
)
boxes
Converting that to a GeoDataframe
grid_df= gp.GeoDataFrame(
data={"geometry": boxes.values, "latitude": boxes.latitude, "longitude": boxes.longitude},
index=boxes.indexes["point"],
crs=crs_orig
)
Overlaying this with the region geometries and computing area weights
overlay = grid_df.overlay(regions_df)
grid_cell_fraction = (
overlay.geometry.area.groupby(overlay.SOVEREIGNT)
.transform(lambda x: x / x.sum())
)
Turning this into a sparse Xarray dataset
multi_index = overlay.set_index(["latitude", "longitude", "SOVEREIGNT"]).index
df_weights = pd.DataFrame({"weights": grid_cell_fraction.values}, index=multi_index)
ds_weights = xr.Dataset(df_weights)
weights_sparse = ds_weights.unstack(sparse=True, fill_value=0.).weights
Applying the matrix multiplication
Note that we can’t just use xr.dot
because of einsum implementation · Issue #31 · pydata/sparse · GitHub.
def apply_weights_matmul_sparse(weights, data):
assert isinstance(weights, sparse.SparseArray)
assert isinstance(data, np.ndarray)
data = sparse.COO.from_numpy(data)
data_shape = data.shape
# k = nlat * nlon
n, k = data_shape[0], data_shape[1] * data_shape[2]
data = data.reshape((n, k))
weights_shape = weights.shape
k_, m = weights_shape[0] * weights_shape[1], weights_shape[2]
assert k == k_
weights_data = weights.reshape((k, m))
regridded = sparse.matmul(data, weights_data)
assert regridded.shape == (n, m)
return regridded.todense()
precip_regridded = xr.apply_ufunc(
apply_weights_matmul_sparse,
weights_sparse,
precip_in_mem,
join="left",
input_core_dims=[["latitude", "longitude", "SOVEREIGNT"], ["latitude", "longitude"]],
output_core_dims=[["SOVEREIGNT"]],
dask="parallelized",
meta=[np.ndarray((0,))]
)
# takes < 10s to regrid over 9000 timesteps
precip_regridded.load()
Plot some data
precip_regridded.sel(SOVEREIGNT="Italy").resample(time="MS").mean().plot()
Thoughts
I think this is a promising way forward for regridding. It removes the xESMF dependency. The challenge will be scaling it up to ultra-high-resolution global grids with millions of points. I will explore that in a follow-up post.