I’m trying to use rioxarray+dask to compute the mean value inside each polygon given a raster layer (‘zonal statistics’) following the advice linked in the rioxarray docs (Example - Zonal Statistics — geocube 0.5.1 documentation)
I’ve tried to modify the example to use xarray and on-disk reads. Unfortunately, the resulting operation still takes over 30 GB of RAM, crashing out my session. How do I adjust this to stream this processing steps outside of RAM?
import geopandas
from geocube.api.core import make_geocube
import multiprocessing.popen_spawn_posix
from dask.distributed import Client, LocalCluster, Lock
import rioxarray
parquet = "https://huggingface.co/datasets/boettiger-lab/pad-us-3/resolve/main/pad-mobi.parquet"
# apparently we cannot read this directly from the URL with geopandas. download first.
geo = geopandas.read_parquet("pad-mobi.parquet") # ~ 4.8 GB RAM
band_name = "human_impact"
cog = "https://data.source.coop/vizzuality/hfp-100/hfp_2021_100m_v1-2_cog.tif"
raster = (rioxarray.
open_rasterio('/vsicurl/'+cog, masked=True,chunks=True, lock=False).
rio.clip(geo.geometry.values, geo.crs, from_disk=True).
sel(band=1).drop_vars("band")
)
This already crashes on my 30GB instance. Ideally I’d like to continue with the raster extraction of the means like so:
out_grid = make_geocube(
vector_data=geo,
measurements=['row_n'],
like=raster, # ensure the data are on the same grid
)
out_grid["values"] = (raster.dims, raster.values, raster.attrs, raster.encoding)
grouped_raster = out_grid.drop_vars("spatial_ref").groupby(out_grid.row_n)
grid_mean = grouped_raster.mean().rename({"values": band_name})
zonal_stats = xarray.merge([grid_mean]).to_dataframe()
geo = geo.merge(zonal_stats, how="left", on='row_n')
geo.to_parquet("test.parquet")
as written this seems to use even more RAM, (and in both cases seems also to be executing on only a single thread).
EDITS:
Because the raster extent is global while the polygons are in the US, this particular use case can benefit from: rio.clip_box(*geo.total_bounds, crs=geo.crs)
before doing rio.clip()
(a trick mentioned in the rioxarray docs). I’m still not getting disk-based chunking though and still crashing on the raster extraction.
I can lean in further to the box-subsetting by trying to draw small numbers of polygons out of the parquet at a time and loop over. This seems like a natural solution for dask, but so this does not appear to running on dask at all?