Intersecting shapefiles and raster data at scale - good design patterns?

Simple high level requirement: create a tabular dataset for machine learning by intersecting EO raster data with ground truth in shapefile geometries. An in-core dataset can usually be created this way (it’s not necessarily a massive dataset once you pull the data you want).

I’ve got something working. I’m okay creating an xarray dataset of stacked EO arrays (e.g. each one named _). I can also load the shapefile as a geoDataFrame in geopandas and then pass it to make_geocube to create a rasterized xarray dataset of the features from the shapefile (which would be ground truth labels). I can then add this rasterized ground truth dataset to the EO dataset using xarray merge. Then I cast it all to a dask dataframe, dropna (on the output label), and save to parquet. This creates a nice tabular dataset with each feature and ground truth label as a column and the pixels as rows, with only the valid data (where there was a ground truth polygon).

But when you have a large area with sparse polygons (widely separated on the ground), you waste a lot of time processing the chunks that you know are empty of useful data. It feels like the best idea would be to just iterate over the polygons (rows in the shapefile) but this is where I could use some advice. What would be a good design pattern here? It feels like I could create a Python list of xarray datasets of each rasterized polygon, merge each with the EO rasters (keeping just the extent of the rasterized polygon), and then iterate over that list of small datasets (that focus on each area of interest) and output them to parquet (or whatever). And then I come up against how to do that efficiently with dask. I first, naively, thought I could create a function that casts a dataset to dask dataframe, drops na, and then output to parquet, and dispatch this to workers using dask’s That gave me an error about serializing async objects (or somesuch, sorry for the hand-wavy error summary). I can simply iterate over each of my small xarray datasets in the list and pass them to my helper function. I’ve done this and each dataset goes like greased lightning once it’s serialized and sent to the workers. And there’s the problem. The serialization overhead for each small dataset is horrendous and the whole process took twice as long it did before (when crunching through and filtering out vast swaths of NA).

So what’s a suggested design pattern here? Some sort of dataset whereby I can optimally process just the bounding boxes for the known polygons. Put all in a single dataset and add a dimension that’s an ID for each one? I’m fairly recently new to all of geospatial, xarray, dask so there seems to be a million moving pieces at the moment. Thanks for reading and I look forward to getting to know this community more!


1 Like

Welcome to the Pangeo community, @Guy_Maskall! I’ll caveat my answer by saying that I just started actually using dask in my own workflow yesterday (and think it may have contributed to the GPU Panic kernel crash I’m dealing with now…), though I’ve been thinking about it for much longer. I’d defer to @rsignell, @scottyhq, and others if they have a recommended design pattern, but (as someone who’s been doing geospatial work for over a decade, much of that in Python) I frequently encounter situations similar to the one you outlined so I’ll share how I recently solved a similar problem in my own work.

Overall, I think the recommended approach will come down to the relative size of your data and what your end goals are (so the most stock, unhelpful answer ever). For my approach, I had few enough polygons (on the order of 100s) that, similar to yours, are relatively widely separated and leave behind large portions of the xarray dataset as NAN, so rasterizing them with make_geocube and then extracting the pixel values within each polygon was quite slow, especially since I don’t anticipate needing the extracted pixel values again. Since I established my workflow with companion xarray DataSet and geopandas DataFrame for corresponding raster/vector data, I ended up finding it was much faster (~3 times) to just use iterrows on the dataframe geometry and use the below function to extract the pixels, later doing some math and adding the result back to the geodataframe:

def get_px_vals(datarow, geom_name, raster, crs=None):
        Extract pixel values where the input geometry overlaps a raster. It speeds the process up by first
        subsetting the raster to the geometry bounds before extracting each overlapping pixel.
        Currently, it's assumed that the CRS of all inputs match.
        The input geometry is assumed to be a row of a GeoDataFrame, with geometry in column "geom_name"
            geom_name : string
                        string name of the geometry column (cannot use built-in 'geometry') because `apply()`
                        turns it into a non-geo dataseries - currently noted as a bug with rioxarray (2021-01-11)
            raster : rioDataArray
        # Bug? (rioxarray): and clip box work, but just using 
        # with a specified crs (and bounds as the polygon) gives a crs error. Huh?
            subset_raster =*datarow[geom_name].bounds)
            # subset_raster =[datarow[geom_name].bounds], crs=crs)
            vals =[datarow[geom_name]], crs=crs).values.flatten()
            # rioxarray introduces a fill value, regardless of the input nodata setup
            vals[vals==-9999] = np.nan
        except NoDataInBounds:
            print('no data')
            vals = np.nan
        return vals

You’re welcome to use this or other code from my current workflow (the non-main branch usually has the most current implementations, since it’s an in-progress project). I look forward to hearing how you go about implementing this and what the thoughts of the other experts in the room are!


@Guy_Maskall do you have a (relatively) simple example demonstrating the issue? In particular: what’s the desired output? A dataframe like

label | value
    0 | 1.5

where the label comes from the vector data indicating the region that matches that pixel, and value comes from the raster data (maybe you have one column per band?, so value_1, value_2 …).

If you’re able to provide an example I’d be happy to dig into the scaling side. Perhaps one of Jessica’s notebooks in icebath/notebooks at master · JessicaS11/icebath · GitHub is a good exmpale?

1 Like

Not entirely sure whether I understand your problem correctly.

Try regionmask. Open shapefile via geopandas and aggregate all values of an xarray dataset in a given region. Results in xarray can be converted to dataframe. Allows to work with dask also.

1 Like

Guy, this is one of the things that Geoscience Australia’s Uncover-ML looks at - but point targets, as opposed to all polygons at once.

Which works quite nicely - it is an mpi-rasterio workflow. A few windows bugs given it was basically designed around HPC work.

You have a shapefile of targets and it takes the raster data from your feature input list/directory.

Thanks Jessica!

I think crs handling gets tinkered with over time/versions?

Hi Guy,

I’ve been building a python package that might offer what you’re looking for: GitHub - ks905383/xagg: Aggregating gridded data (xarray) to polygons

If you have raster data in an xarray dataset, and polygon data in a geopandas geodataframe, xagg returns the geographically-weighted average of the raster data over the polygons. In other words, for each polygon, it returns the average value of all the raster pixels that touch/overlap it, weighted by how much each pixel overlap the polygon.

It can also deal with extra dimensions (so Lon x lat x time data for example), and it can currently output data into csv, xarray, shape files… I’m also always looking for what peoples’ actual use cases are, especially in terms of output data format.

It’s generally pretty fast - I primarily work with climate data + administrative polygons, and it’s usually on the order of seconds to generate the weights.

I hope this is useful!


PS: in my mind, the benefit of xagg over regionmask (beyond the UI which is made specifically for this aggregating task) is the greater accuracy - instead of just masking rasters, you discount pixels that barely overlap with the shapefile, which could be important in areas with complex topography or if your pixels are large. (there’s also the possibility to weight using additional raster grids, but that’s more of a thing that economists are looking for through pop-weighting)


Thanks for sharing Kevin!

Our general recommendation is that packages like xagg shouldn’t do I/O at all. Consume and produce standard data containers and leave the I/O to xarray / pandas etc.

Oh, I’m used to Windows bugs! WinError XXX, for which the only search results that seem to come up are along the lines of “Seems to be some weird edge case on Windows VMs to do with latency”. Great. ;-/

Hi @TomAugspurger ,

An ultimate end goal (note “an”, not “the”) is a dataframe like
Band02_20200119 | Band04_20200119 | ... | Band 12_20181012 ... | Label_1 | Label_2 | ... | Label_N
where BandXX_XXXXXXX come from EO data, and Label_k comes from the shapefile. In practice, N won’t be very large but it’s typically > 1 because I like to keep the polygon (e.g. field) ID in order to use it as a grouping variable in model cross-validation later on. So Label_1 might be “field_id” and Label_2 “class_label”.

I say “an ultimate end goal” because the dataframe structure (saved as parquet) isn’t necessarily the only end goal. The fundamental capability I’d like is to be able to easily process spatially separated polygons (in this case each row of a shapefile) on a dask worker, which rasterizes the features assocated with that polygon and adds them to an EO dataset. I’m still developing the mental models for xarray (multidimensional datasets with dimensions and coordinates start to make my head spin).

I’m just now wondering if I could take my list of N datasets (corresponding to bounding boxes of M variables around N polygons from N rows of a shapefile) and merge them into a single dataset. You’d still end up with a bounding box over all coordinate values, right? Could I take the list of N datasets and create a single dataset but add a new dimension for each one? Then the x and y coordinates for each dataset would just be the contiguous values for that single (small) bounding box. Then I’d have a single dataset I could chunk along the new dimension. I’m pretty sure you can do that, I’m just not sure it’s the best approach. Hence pondering whether there are any good design patterns for this sort of thing (whilst trying to wrap my head around concatenate vs merge vs combine…)

@RichardScottOZ Most definitely! If I don’t note it, I’ll spend time figuring it out a second time at a later date. Plus, it means I have a good spot to start when I have a few minutes to make some OSS contributions. : )

Yeah, the prediction part on some things on uncover-ml fail multiprocessing, not idea, but was only running some test cases. Modified a few things running my own branch. Still a very nice framework though.

GitHub - stevenpawley/Pyspatialml: Machine learning modelling for spatial data also does some predicting on raster stacks with byo model - Steven Pawley is an R guy, so probably not done much tinkering with Dask, will have to ask.

Yes, good plan. That has bitten me before.