Advice for scalable raster-vector extraction?

I’m trying to use rioxarray+dask to compute the mean value inside each polygon given a raster layer (‘zonal statistics’) following the advice linked in the rioxarray docs (Example - Zonal Statistics — geocube 0.5.1 documentation)

I’ve tried to modify the example to use xarray and on-disk reads. Unfortunately, the resulting operation still takes over 30 GB of RAM, crashing out my session. How do I adjust this to stream this processing steps outside of RAM?

import geopandas
from geocube.api.core import make_geocube
import multiprocessing.popen_spawn_posix
from dask.distributed import Client, LocalCluster, Lock
import rioxarray

parquet = "https://huggingface.co/datasets/boettiger-lab/pad-us-3/resolve/main/pad-mobi.parquet"
# apparently we cannot read this directly from the URL with geopandas.  download first.
geo = geopandas.read_parquet("pad-mobi.parquet") # ~ 4.8 GB RAM

band_name = "human_impact"
cog = "https://data.source.coop/vizzuality/hfp-100/hfp_2021_100m_v1-2_cog.tif"

raster = (rioxarray.
              open_rasterio('/vsicurl/'+cog, masked=True,chunks=True, lock=False).
              rio.clip(geo.geometry.values, geo.crs, from_disk=True).
              sel(band=1).drop_vars("band")
         )

This already crashes on my 30GB instance. Ideally I’d like to continue with the raster extraction of the means like so:

out_grid = make_geocube(
    vector_data=geo,
    measurements=['row_n'],
    like=raster, # ensure the data are on the same grid
)
out_grid["values"] = (raster.dims, raster.values, raster.attrs, raster.encoding)
grouped_raster = out_grid.drop_vars("spatial_ref").groupby(out_grid.row_n)

grid_mean = grouped_raster.mean().rename({"values": band_name})
zonal_stats = xarray.merge([grid_mean]).to_dataframe()
geo = geo.merge(zonal_stats, how="left", on='row_n')
geo.to_parquet("test.parquet")

as written this seems to use even more RAM, (and in both cases seems also to be executing on only a single thread).

EDITS:

Because the raster extent is global while the polygons are in the US, this particular use case can benefit from: rio.clip_box(*geo.total_bounds, crs=geo.crs) before doing rio.clip() (a trick mentioned in the rioxarray docs). I’m still not getting disk-based chunking though and still crashing on the raster extraction.

I can lean in further to the box-subsetting by trying to draw small numbers of polygons out of the parquet at a time and loop over. This seems like a natural solution for dask, but so this does not appear to running on dask at all?

1 Like

For comparison, this is the approach I use in R, where I am a bit more familiar. We use a one-line command to read each of the raster and vector objects, and one more line to extract the zonal statistics in parallel.

library(sf)
library(gdalcubes)
gdalcubes_options(parallel=TRUE)

## read raster
cog = "https://data.source.coop/vizzuality/hfp-100/hfp_2021_100m_v1-2_cog.tif"
cog = basename(cog) # using downloaded copy for speed
raster = stack_cube(cog, "2022-01-01", bands = "human_impact")

# read vector (subset)
parquet = "https://huggingface.co/datasets/boettiger-lab/pad-us-3/resolve/main/pad-mobi.parquet"
parquet = basename(parquet) # using downloaded copy for speed
vct = st_read(parquet)

# zonal stats
result = extract_geom(raster, vct[1:50000,], FUN=mean)

The R version works outside of RAM for the raster operations by streaming the relevant ranges from disk. It does require the vector object is in memory, so I’m operating on the first 50,000 (of ~ 410K) polygons above, but this runs in about a minute on my machine using only about 4 or 5 GB of RAM. I imagine we can get similar or better performance on the python side but I can’t seem to figure it out.

I would rasterize it as a pre step (centre point of pixel falls inside polygon, approximation at a chosen resolution).

But, do you need exact metrics? Because exactextract now has Python bindings (ee does geometry overlay to get proportion for boundary-crossed pixels as polygons, and flood fills the rest).

thanks – no, not too worried about exact methods, so rasterizing makes sense. Not sure about the internals of extract_geom on that, but it looks to me like that is what the documented approach I’m using in from geocube does? (i.e. the “out_grid” creation? I’m not all that familiar with the geocube package data model).

exactextract seems to be based on in-memory rasters?

I think this is a relatively common calculation to do and one that conceptually at least seems like it should be ‘embarrassingly parallel’, there must be a nice pythonic way to do this that can leverage multiple threads and chunking without unreasonable RAM requirements? All the data used in the example above are public sources if anyone wants to test this out.

If you rasterize your polygons, and don’t care about exactness too much, see flox: Faster GroupBy reductions with Xarray for a totally alternative approach to “zonal” statistics.

create masks of geospatial regions — regionmask 0.12.1 documentation might help with rasterizing the polygons

3 Likes

There is this, too:- Support for rasterisation to dask arrays · Issue #41 · corteva/geocube · GitHub

1 Like

I’m certainly interested to try! It’s fairly easy to set up block-read of tiles, and with newish GDAL you can use “vrt://{dsn}?srcwin=offsetx,offsety,size_x,size_y” which would template trivially into a dsn for every tile. I’ve got helpers in R and will try doing same in Python.

But, would it be efficient to extract from the entire polygon layer without batching polygons to tiles, or vice versa? (maybe it would be)

Something to find out, Michael!

Well it definitely isn’t in terra (it might be if we use GDAL more closely), but we can do a windowed-read on extent which is a standard GDAL capability, exposed in sf and in terra (we have to project the extent to match the vector with proper buffering-out), and then it is pretty efficient, although it will over-read the vector for each tile. So it does parallelize nicely from a table of srcwin reads and tile extents, the logic to create the table of tile offsets is pretty strightforward but I don’t know if Python helpers exist for that.

The GDAL arguments I’m using are “-srcwin” ‘xoff yoff xsize ysize’ for raster Translate which is 0,0,512,512 for the first top left tile and increment from there (I’m using 4x4 tiles read in my tests), and “-spat” ‘xmin ymin xmax ymax’ for vector Translate.

I’m pretty confident that this worked, I ran it in parallel across 64 cpus and it took about 10min ~30min~ (when I took out the redundant vect() open in every iteration).

needs some translation into python world, and to cull out obviously non-target tiles (could cull to the Mollweide boundary also, but cull to the projected vector extent is enough)…

I didn’t realize there was so little overlap of the vector to raster until I’d finished (raster is global Mollweide, polygons are conus equal area Albers), there’s an overhead to generating the projected extent and empty windowed read of the vector layer, but not much. (I used downloaded copies of the 13Gb raster and 1Gb vector not the vsicurls).

Apologies this is a bit hijack-y from the OP, but I’m keen to follow up and relate it to that closely.

1 Like

Hi Carl, so the issue is dask does not operate on polygons. Ultimately the polygons must be converted to a integer mask which has the same dimensions as the raster. As far as I can tell rioxarray/geocube keeps this polygon mask in memory, so that is where all the memory issues come from. I cannot find a way to easily write the mask to disk, where it could be read in chunks with rioxarray.

I made some progress by using a clip by bounds instead of clip by mask in the 1st step. But I still run into memory issues on my little laptop.

import numpy as np
import geopandas as gpd
import xarray
import rioxarray
from geocube.api.core import make_geocube
import rasterio

tif_file = './hfp_2021_100m_v1-2_cog.tif'
pad_file = './pad-mobi.parquet'

band_name = "human_impact"


with rasterio.open(tif_file) as src:
    raster_profile = src.profile

pad = gpd.read_parquet(pad_file)

pad = pad.to_crs(raster_profile['crs'])

# Some small adjacent polygons to test with
#pad_subset_ids = [98293, 98308, 98290]
#pad = pad.query("row_n in @pad_subset_ids")


raster = (
    rioxarray
    .open_rasterio(tif_file, chunks=True)
    #.rio.clip(pad.geometry.values, pad.crs, from_disk=True, drop=True)
    .rio.clip_box(*pad.total_bounds)
    .rename('human_impact')
)

out_grid = make_geocube(
    vector_data=pad,
    measurements=['row_n'],
    like=raster, # ensure the data are on the same grid
)

grouped_raster = xarray.merge([out_grid, raster])

zonal_stats = (
    grouped_raster
    .drop_vars("spatial_ref")
    .groupby(out_grid.row_n)
    .mean()
    .to_dataframe()
    .reset_index()
    .merge(pad, how='right', on='row_n')
    )

Alternatively if you just want to get the zonal stats and move on, the rasterstats package works well and is pretty memory safe. Here’s a joblib wrapper for it to run in parallel. This does not use dask/xarray at all and ran for me in 30 min.

import rasterio
from rasterstats import zonal_stats
import geopandas as gpd
import pandas as pd

from joblib import Parallel, delayed

tif_file = './hfp_2021_100m_v1-2_cog.tif'
pad_file = './pad-mobi.parquet'

with rasterio.open(tif_file) as src:
    raster_profile = src.profile

pad = gpd.read_parquet(pad_file).to_crs(raster_profile['crs'])

def get_stats(geom_info):
    stats = zonal_stats(    
        geom_info.geometry,
        tif_file,
        stats = ['mean'],
        )
    stats[0]['row_n'] = geom_info.row_n
    return stats[0]

jobs = []
for r in pad.itertuples():
    jobs.append(
        delayed(get_stats)(r)
        )

output = Parallel(n_jobs=5, verbose=10)(jobs)

pad_zonal_stats = (
    pd.DataFrame(output)
    .rename(columns={'mean':'human_impact'})
    .merge(pad, how='right', on = 'row_n')
    )
1 Like

It seems like a massive gap in the Python geospatial ecosystem that we don’t have a good analog for this very common workflow that can scale beyond RAM.

1 Like

Maybe this could help?
https://pangeo-xesmf.readthedocs.io/en/latest/notebooks/Spatial_Averaging.html

Hi!
We actually faced this issue in the openEO implementation of the aggregate_spatial (I can’t attach more than 2 links, I format it as code): https://processes.openeo.org/#aggregate_spatial) process, which performs zonal statistics.

We integrated the functionality to be used with Dask in the xvec package with this PR:
https://github.com/xarray-contrib/xvec/pull/52

Here are the docs for zonal stats using xvec:
https://xvec.readthedocs.io/en/stable/zonal_stats.html

And here is the discussion about scaling it over large datasets and the issue we encountered using Dask:

I hope this helps! Let me know if you have further specific questions about it.

2 Likes

@cboettig I wonder if xarray-spatial may be suitable in your case: GitHub - makepath/xarray-spatial: Raster-based Spatial Analytics for Python

I am actually looking too for a solution on how to do zonal statistics, given a reference mask ( the idea would be using the mask either: 1- to remove data, given the mask flag; 2- to assign weights, given reference values in the mask). Any advice on this welcome too :slight_smile:

Thanks everyone for all these great suggestions and additional python libraries to explore. I’ve taken a go at trying most of them and just wanted to share my experience, though that probably reflects more on my limitations than limitations of any of these great packages.

While it doesn’t use xarray or dask, @sdtaylor’s suggestion using rasterstats + joblib worked very nicely for me. Attempting to use rasterstats out-of-the-box on the whole vector + whole raster file crashes immediately, but breaking it up feature-by-feature as Shawn shows looks simple and scaled easily to use many threads with little RAM. I was able to process the data shown in under three minutes using 24 cores and about 6 GB RAM.

I have had less luck attempting to apply the other python libraries suggested, though some of this is no doubt my lack of understanding.

  • Trying to use xvec as documented at that link I run into the same RAM issues that I faced with official rioxarray + geocube strategy.

  • I was unable to test the xesmf , which seems to have some burly external library dependencies I couldn’t resolve.

  • xarray-spatial looks cool (some methods have dask GPU support! but not zonal methods). But I struggled to find docs, webpage appears to have lapsed in domain registration https://xarray-spatial.org/, the readme on zonal stats points me to the source code.

It’s exciting to see all these packages and I look forward to experimenting more and watching this space evolve more as well. Really appreciate all the discussion and ideas.

4 Likes

I wanted to follow up here and provide a Python way to generate the (approximate-case) mask raster from the polygons, and I hope to do that. But also wanted to point out that a rasterized polygon layer can be immensely compact by storing each start and end column of polygon ID by row in the grid. This is how R’s {fasterize} works but it actually materializes the entire output grid in memory still, currently. This repo is my step towards that efficiency and compaction, you can store immensely large grids this way, but it doesn’t match any format that I know.

In longer term dreaming, I wish that exactextract and this compact-fasterize approach would be folded down into something compatible, and cross-lang lib for use in R and Python and elsewhere. I think I can see my way into folding fasterize into GDAL at least, and maybe we can garner some consolidation across these exact and approximate methods.

1 Like

Thanks @Michael_Sumner , I need to try more with masking approaches.

Just another follow-up here that I’m still very much interested in working out some more robust python solutions here. while the rasterstats with brute-force parallelization was working very nicely with the ~ 400,000 small continental us polygons there, I’ve found it doesn’t handle larger/more polar polygons (looking at you, protected areas in Alaska) very well at all, and I seem to be crashing out of RAM with even a single polygon.

I’ll keep poking at this for a robust workflow. If anyone wants to explore, just take a look at this copy of the polygons from Source Cooperative, which is just a geopaquet version of the non-continuous US polygons from the USGS gdb file Protected Areas Database of the United States (PAD-US) 3.0 (ver. 2.0, March 2023) - ScienceBase-Catalog )

Happy Earth Day! Let’s play with some protected area data!

1 Like

Hi @cboettig

I tried running this through exactextract in Python on my laptop. The following takes about 10 minutes for the ~438,000 features (not clipped to CONUS) with 1 core and 3-4g RAM. Interestingly much of this time is consumed by just a couple of the polygons. I’ll look into these and see if I can find opportunities for improvement.

I’m expressing the input as a GDAL VRT so that we can reproject the input polygons without loading them all into memory.

from exactextract import exact_extract
from osgeo import ogr, gdal

gdal.UseExceptions()
ogr.UseExceptions()

rast = gdal.Open("/home/dan/data/hfp_2021_100m_v1-2_cog.tif")

padus_fname = "/home/dan/data/PADUS3_0Geopackage.gpkg"

vrt = f"""
<OGRVRTDataSource>
    <OGRVRTWarpedLayer>
        <OGRVRTLayer name="combined">
            <SrcDataSource>{padus_fname}</SrcDataSource>
            <SrcLayer>PADUS3_0Combined_DOD_TRIB_Fee_Designation_Easement</SrcLayer>
        </OGRVRTLayer>
        <TargetSRS>{rast.GetSpatialRef().ExportToWkt()}</TargetSRS>
    </OGRVRTWarpedLayer>
</OGRVRTDataSource>
"""

polys = ogr.Open(vrt)

exact_extract(rast, polys, 'mean', output='gdal', output_options = {'filename' : '/tmp/out.csv' })
1 Like

Nice one, good to see such tight and crafty GDAL usage, and I didn’t realize that the exactextract style would be like that :ok_hand: