Hello everyone, I’m back after a large hiatus but I need to pick up where I left off on this question!
To start off, I have some debt to you all for getting me a lot farther with your replies. I wrote two or three functions that were really useful to me in my workflow that are able to do the job nicely… One was relies on the coarsen() method that @rsignell threw out there.
def coarsen_dask_arr(arr: da, scaling_value: int, resample_alg='average', mask=None):
"""
big help with implementation from Rich Signell.
:param arr: Dask array opened as: rxr.open_rasterio(path).squeeze('band', drop=True)
:param scaling_value: a whole interger. In the SSEBop context 5, 25 or 100.
:param resample_alg: 'average', 'mean' or 'sum' accepted.
:return: dask array that has been resampled based on scaling factor.
"""
with dask.config.set(**{'array.slicing.split_large_chunks': False}):
# To avoid creating the large chunks, set the option
# >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
# ... array.reshape(shape)Explictly passing ``limit`` to ``reshape`` will also silence this warning
# >>> array.reshape(shape, limit='128 MiB')
# dsub = arr.coarsen(x=scaling_value, y=scaling_value, boundary='pad').mean()
if resample_alg == 'average':
#https://docs.xarray.dev/en/v2022.06.0/generated/xarray.core.rolling.DataArrayCoarsen.
#construct.html#xarray.core.rolling.DataArrayCoarsen.construct
dsub = arr.coarsen(x=scaling_value, y=scaling_value, boundary='pad').mean()
elif resample_alg == 'mean':
dsub = arr.coarsen(x=scaling_value, y=scaling_value, boundary='pad').mean()
elif resample_alg == 'sum':
# is this correct?
dsub = arr.coarsen(x=scaling_value, y=scaling_value, boundary='pad').sum()
else:
print('WARNING')
print(f"{resample_alg}, is not supported, options are: 'average', 'mean' or 'sum'")
raise TypeError
return dsub
This allowed us to coarsen our arrays and do zonal statistics really nicely. Another thing that was huge was that I discovered that you can open vrt (gdal virtual rasters) as dask arrays, so that allowed me to open all the files more easily… Here is a function that resamples files to the same extent and resolution and returns a dask array…
def normalize_to_std_grid_dask(inputs, nodatas=[], sample_file=None,
resamplemethod='nearest', outdtype='float64', overwrite=True):
"""
Uses rasterio virtual raster to standardize grids of different crs, resolution, boundaries based on a shapefile geometry feature
:param inputs: a list of (daily) raster input files for SSEBop.
:param outloc: output locations 'temp' for the virtual files
:return: list of numpy arrays
"""
outputs = []
with rasterio.open(sample_file) as src:
out_meta = src.meta
crs = out_meta['crs']
transform = out_meta['transform']
left = transform[2]
top = transform[5]
cols = out_meta['width']
rows = out_meta['height']
xres = transform[0]
yres = transform[4]
# return out_meta
if resamplemethod == 'nearest':
rs = Resampling.nearest
elif resamplemethod == 'average':
rs = Resampling.average
else:
print('only nearest-neighbor and average resampling is supported at this time')
sys.exit(0)
for i, warpfile in enumerate(inputs):
print('warpfile', warpfile, i)
print(f'warping {warpfile}\n with nodata value: {nodatas[i]}')
# TODO: Source dataset should be opened in read-only mode. Use of datasets opened in modes other than
# 'r' will be disallowed in a future version.
with rasterio.open(warpfile, 'r') as src:
# create the virtual raster based on the standard rasterio
# attributes from the sample tiff and shapefile feature.
# update with suitable nodata values.
nodata_val = nodatas[i]
# src.nodata = nodata_val
with WarpedVRT(src, resampling=rs,
crs=crs,
transform=transform,
height=rows,
width=cols,
dtype=outdtype) as vrt:
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
# To avoid creating the large chunks, set the option
# >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
# ... array.reshape(shape)Explictly passing ``limit`` to ``reshape`` will also silence this warning
# >>> array.reshape(shape, limit='128 MiB')
# dsub = arr.coarsen(x=scaling_value, y=scaling_value, boundary='pad').mean()
data = riox.open_rasterio(vrt, chunks=(1000, 1000)).squeeze('band', drop=True)
outputs.append(data)
return outputs
I never figured out how to make my earliear aproaches with pyresample or rioxarray resampling work.
I did have the need to bilinearly resample raster files and this dask approach worked for me but was memory-intensive:
cfactor.load()
cfactor_bilinear_ds = cfactor.interp(y=tmax_ds['y'], x=tmax_ds['x'], method='linear')
This executes a linear interpolation in space, which was neat! I am putting it here bc ppl may want to know about it. With files of the size I’m working with ~2GB rasters (AKA Earth at 1km resolution), I was only able to get the linear interpolation above to work with 64GB of RAM. My workaround when I have a lower memory device is to use the rasterio.vrt and virtualwarp the file with bilinear interpolation which works ok… So thanks everybody you really did set me down the right track.
There are still issues with unmanaged memory and warnings that I get, but I’ll put more detail and create a separate question and link it here, because I think it is related, but separate…