Conservative Region Aggregation with Xarray, Geopandas and Sparse

@TomNicholas from this comment: einsum by jcmgray · Pull Request #564 · pydata/sparse · GitHub I thought dot might not be working yet.

FWIW, I just tried:

regridded = xr.dot(ds[var], weights_sparse)

and it returns:

TypeError: einsum() got an unexpected keyword argument 'dtype'

You’ll have to pass the underlying arrays using .data
sparse.einsum('ij,jk->ik', ds[var].data, weights_sparse.data)

@dcherian, Ooh, closer?

regridded = sparse.einsum('ij,jk->ik', ds[var].data, weights_sparse.data)

returns:

ValueError: total size of new array must be unchanged

so probably just need to study a bit more?

Seems like the low tech (ds[var] * weights_sparse).sum(weights_sparse.dims) does what you want?

If there is a way to multiply a dask Array by a sparse matrix, then xESMF can take advantage of this and could, I believe, run using dask distributed. The weight multiplication logic for xESMF lives here.

4 Likes

Hey Everyone, just found this thread as it intersects with work I’ve been doing that involves a significant amount of remapping between various grids. I have been a user of xESMF for awhile and also have made ample use of things like pyresample and various other approaches that use tree-based query structures for resampling.

I’ve recently had a use-case that involves the need to conservatively remap fields from ocean data products on their native grids. I also came to the conclusion that developing detailed grid polygons for source and destination grids within GeoPandas (and making use of pygeos, in particular to quickly define geometries) to be very effective.

In addition to supporting areal overlap (intersection overlays), it is also incredibly useful to perform edge-based intersections between the source and destination grids. The latter make it possible, for each destination cell edge, to determine how much of that edge lies within the source cell. This open the possibility for developing weighted-average resampling of transport quantities (e.g., a corollary to area-based fields) across transects based on weighted contributions from the source cells (the weight is the length of the segment in each source cell). This is not something I’ve seen highlighted but is, I think, a really interested positive to moving towards the used of Geopandas/Vector-based grid descriptions, especially for those that want to move to aggregated transports across line segments (e.g., think ocean transports across a strait defined by some LineString)

Some comments that I have, however, regarding the current discussion:

  1. Dealing with the antimeridian. It just so happens that the source grid cells from the model in my use case are defined at points along the antimeridian so that its bounds cross-over. For performing spatial overlays/intersections, this becomes problematic. When moving to the destination grid (which did not cross the antimeridian), 1/2 of the grid box on the edge (in my use case) was not being found during the intersection. It appears that the recommendation according to GeoJSON is to split those offending polygons along the antimeridian. I think any discussion here regarding resampling should be thinking about defining conventions for grid/cell polygons to ensure robustness of spatial operations:
    The 180th Meridian - macwright.com

  2. It is not clear to me if there is a huge performance boost for sparse matrix multiplication vs. the use of numpy-groupies (or generally, groupby-) operations for aggregations. At its core, the area-conservative resampling is simply the weighted averaging accomplished by aggregation of the sum of the weighted values divided by the sum of the weights. However, things like xESMF and the underlying RegridWeightGen tend to also allow for doing things with masking of values and/or weights to support improved dealing with grids where masks are involved (e.g., ocean/land masks in particular). I have opted to simply use the Geopandas spatial operation to define all the intersections (e.g., area overlay) and then I store those “links” (e.g., source index and destination index pairs) together with the area itself. The remapping is then simply accomplished by aggregation using ‘sum’ of the weighted data and weights (and these can be optionally masked at time of resampling). The heavy lifting (intersection) is still done once and all links/areas stored in a file. The penalty here is not precomputed the weights as the weighted fraction itself and having to recompute the sum of the weights. But, this is an acceptable tradeoff to me so I can better decide how I want to deal with masking.

The question then is whether or not its more optimal to go with sparse matrix multiplies vs. optimized forms of group-by aggregation, and whether either of these would optionally work better with extensions to dask.

I hope these comments add to the really great work/thinking you all are doing.

Note: Moving to well-defined grids using GeoPandas polygons also opens the door to K-nearest-neighbor and distance based resampling. However, you really need to be able to make use of the spatial index querying; I was able to get the single nearest point but it wasn’t clear that pygeos supports finding the K-nearest set of elements (maybe in the newest version?) if you wanted to average points within some neighborhood of the desired target grid.

1 Like

The question then is whether or not its more optimal to go with sparse matrix multiplies vs. optimized forms of group-by aggregation, and whether either of these would optionally work better with extensions to dask.

I’m not sure a groupby is the right computational match for this problem even though it is a conceptual match.

For one, you need weights, and those weights are sparse and also a major computational bottleneck (unless reuse is possible which is true for very many cases).

Second, multiply-add is a pretty optimized operation across the stack (tensordot, einsum, and friends). Whether they play nicely together is another question. For example:

For one, you need weights, and those weights are sparse and also a major computational bottleneck (unless reuse is possible which is true for very many cases).

Yes, you need weights for all of these cases. I consider those a one-time generation. Note that if you want to apply some form of masking, however, that the fractional weights that are stored will be different if the masks are dynamic. This incurs a weight generation penalty multiple times. Thus, I see a benefit of simply storing the overlapping area information itself and then masking during the aggregation procedure.

Regarding the optimization — I compared running a conservative remapping directly using xESMF and also using the numpy-groupies based aggregations (summing weighted data and weights) and found them to be quite comparable. For dask, I think the group-by aggregation operations are also well-optimized in general and can allow easy aggregation across chunks.

I also saw this: flox where it looks like others are really thinking about how to speed up group-by in dask.

Anyhow, if others are really in favor of einsum/tensordot/etc. that is fine too … I was just curious as to why aggregations with group-by would not be considered (or perhaps others have already done in depth comparisons). I personally haven’t worked with sparse matrices much and am much more comfortable with the aggregation based approach. It wasn’t clear to me how to quickly create the appropriate sparse matrices from the set of source/destination grid index pairs, though obviously others have been doing this so it may be straightforward (I haven’t had time to thoroughly review the notebooks shared here).

1 Like

I wrote flox to adapt numpy-groupies to dask.

I encourage to write up an experiment comparing xesmf, manual weight generation (like the first post here)+ xr.dot, and with flox; for both the numpy and dask use cases. It sounds like you’re halfway there.

It would be a very cool post on the Pangeo blog or here.

2 Likes

Here’s how to do the sprase matrix multiply relatively easily using opt_einsum

This is the core piece / monkey patch. It works with dask+sparse weights

    def _apply(da):
        # 🐵 🔧 
        xr.core.duck_array_ops.einsum  = opt_einsum.contract
        
        ans = xr.dot(
            da, 
            weights, 
            # This dimension will be "contracted" 
            # or summmed over after multiplying by the weights
            dims="ncol",
        )
        
        # 🐵 🔧 : restore back to original
        xr.core.duck_array_ops.einsum = np.einsum
        
        return ans
3 Likes

Hi, I was curios to see if there is any progress with this?

I found this repo (xarray-regrid) that seems to have implemented a solution for rectilinear grids; their benchmark results seem promising.
Are there any other implementation (with dask support) of conservative regridding that does not rely on (heavy) packages? or is xESMF/CDO the way to go?