Can a reprojection/change of CRS operation be done lazily using rioXarray?

geynard · September 2, 2024, 3:13pm

Hi there!

At CNES, we are currently working on providing an OpenEO interface to query and process our geospatial datasets. When implementing Open EO process graphs involving some reprojection (typically from UTM to EPSG 4326 in the case of Sentinel 2 images), we are facing a case where the reprojection is triggered upon Python call, breaking the laziness of the underlying Dask graph.

The OpenEO part is for context, but I think this is a general problem and I’m not sure if a software solution already exists. I’ll even go further and say if this kind of reprojection problem can be solved or implemented nicely in a given library on Xarray datasets, this would make (rio)Xarray really powerful to work with geo referenced rasters.

I’ve seen the following line at the head of the reproject page of rioxarray doc:

To re-project with dask, see odc-geo & pyresample.

But quickly browsing the other libraries documentations (OK, only for a few minutes), I couldn’t see how to do it. So can this be done, and how?

Michael_Sumner · September 2, 2024, 11:00pm

probably with WarpedVRT, as in Example - Reproject — rioxarray 0.17.0 documentation

by the time you get to rioxarray you are already two layers above the actual warper API, WarpedVRT is a GDAL feature VRT -- GDAL Virtual Format — GDAL documentation

so, perhaps create the VRT with rasterio and then open that into xarray via rioxarray ? But, what is the final result? Many disconnected scenes in EPSG:4326? Or one big scene (perhaps at different time steps)? If the latter then you’ll need each warp step to be in alignment with the rest. I’m not familiar enough with the tooling in xarray/dask here so I’ll try this out at some point, but as above do you get one big xarray or many disconnected ones from this workflow, and in either case do they need to share the same overall (global) grid specification?

KBodolai · September 3, 2024, 7:58am

On the odc-geo side, I’ve had good results using xr_reproject, so you could do something like:

from odc.geo.xr import xr_reproject

reprojected = xr_reproject(ds, how='EPSG:4326', resampling=<whatever is fitting>)

(where ds would be your lazy Dataset or DataArray)

It will attempt to auto-guess the GSD if you only give it the CRS. Another option (probably safest) would be to create a geobox (odc.geo.geobox module) defining your target grid and do xr_reproject(ds, how=<your geobox>,...)

clausmichele · September 5, 2024, 10:17am

Hi @geynard, I’m one of the openEO developer happy that CNES is also implementing it!
In fact, we’re using odc-geo as @KBodolai suggested in the xarray-dask implementation of the resample_spatial process, see here: openeo-processes-dask/openeo_processes_dask/process_implementations/cubes/resample.py at d07eeeae8aaa119aae84079524ffb432418bf6b5 · Open-EO/openeo-processes-dask · GitHub

I would be happy to discuss about it if you need further help!

rabernat · September 5, 2024, 1:05pm

Just to add to the mix, there is yet another package on the scene for regridding with xarray:

This is aimed more at climate model grids and less at geospatial raster imagery. However, it would be great to have a unified approach that works with all types of grids.

geynard · September 10, 2024, 8:26am

Hi everyone, thanks for all the answers here!!

This I believe is the final goal, unified big scene. And I probably don’t want to mess with low level consideration like GDAL or even rasterio.

One big array (Data cube), same overall grid specs.

That looks really neat!

Sure, I’ll contact you in private!

What would be the next step to achieve this goal?

BSchilperoort · September 10, 2024, 12:37pm

xarray-regrid dev here. Currently xarray-regrid does not perform any reprojections, just regridding between rectilinear grids (within the same coordinate system). To implement reprojecting, a routine would be needed that can generate a weights matrix to convert between one grid and another. If you’re interested in this just open an issue at GitHub - xarray-contrib/xarray-regrid: Regridding utility for xarray

rabernat · September 11, 2024, 3:38pm

Just learned about yet another package that addresses this issue: raster-tools

https://um-rmrs.github.io/raster_tools/reference/generated/raster_tools.warp.reproject.html

According to the docs it works with dask.

maxrjones · September 12, 2024, 3:47pm

I don’t have any results to share yet, but I wanted to let people know that I’m working on comparing the amount of heap memory along with the time required to warp resample a high resolution dataset (MUR-SST), motivated by memory usage exceeding AWS lambda memory limits during tile generation. I wrote down the scope of this exercise in this readme if any one feels like providing comments on the plan.

Michael_Sumner · September 13, 2024, 2:26am

interesting!

we hit this limit exactly in lambda too, we requested they increase it for the service (we’re creating COGs, not warping in this instance though I did experiment with that at first, our code now is effectively a GDALTranslate() call but in a very odc-specific context)

I’ll add R pathways to your list

gunbra32 · September 18, 2024, 9:55am

You may also have a look at xcube, which may be provide what you are looking for. It uses pyproj, supports dask, and does not rely on gdal. We are happy for feedback of any kind!
Example for changing crs
Example for rectification

Also see the documentation, which may, however, be incomplete…

Michael_Sumner · September 19, 2024, 5:36am

is there any example yet? Like what tile would you write (bbox, shape, crs)? We’ve been working heavily with MURSST to convert it to COG, and it works fine to “/vsicurl/https:…nc” to a variable (with earthdata creds) with a url and stream it to whatever tile is required amongst the many downstream gdal packages. A few of your approaches would be pretty much exactly the same I expect as the infra is common in osgeo.gdal, rasterio, raster_tools, geoutils, and odc.geo.

geynard · September 19, 2024, 12:43pm

Hi @gunbra32, thanks for your intervention. I’ve been seeing Xcube for along time, but never had the chance (or the time) to take a deeper look at it. I was under the impression that to benefit from xcube there was an ingestion phase of the dataset first, is this not the case?

In the end, what are the difference between odc-geo, pyresample, xcube, raster-tools:

Do they rely on the same approach for reprojection?
Do they use a different set of low level library (e.g. GDDAL, rasterio, others)?

gunbra32 · September 19, 2024, 1:33pm

Hi @geynard,

there is no ingestion of data required for using xcube, it works with xarray’s ds. We created a straightforward dataset convention, which is following CF in most parts. A ds must have lon/lat or x/y as spatial dimensions and coordinates and encode the CRS in a variable crs.

For resample_in_space(), xcube uses pyproj and dask-image and we generally avoid any high-level dependencies like odc-geo or gdal. I have no comprehensive overview of the other tools you mentioned but I guess there might be considerable overlap. Would be interesting to see the differences and similarities between them indeed.

maxrjones · September 19, 2024, 9:39pm

I’m planning to show WIP examples at next Wednesday’s showcase - Pangeo Showcase: "Geospatial reprojection in Python, 2024 - what's available and what's next".

For your specific questions, the struggle is with Web Mercator tiles at zoom levels 0-2. You can see the motivations for my ongoing exploration at the end of this notebook - Tile Generation Benchmarks using CMR – Tile Benchmarking. What infrastructure are you using for COG generation from MURSST?

Michael_Sumner · September 19, 2024, 11:04pm

What infrastructure are you using for COG generation from MURSST?

I prototyped on normal VMs and some HPC (slurm) with R and Python, but we’ve baked it into a lambda task now that’s been creating them for several months as new data comes in. I haven’t run the backlog yet as I was waiting for any major issues to appear (I have one niggle with the latitude-scaling for average resampling in overviews …).

For your specific questions, the struggle is with Web Mercator tiles at zoom levels 0-2.

So, just normal web mercator with 256x256 tiles for the 3 zooms. I am looking at the notebooks you listed but what I’m looking for is the grid specification for the output, which is seemingly embedded in packages I can’t can’t see the source for. (where is ‘cmr_tile_test’?). So to be explicit with pseudocode.

bbox  =  [-20037508.342789244, -20037508.342789244, 20037508.342789244, 20037508.342789244]
crs = "EPSG:3857"
zooms = [0,1,2]
blocksize = [256, 256]

but, maybe you use 512 blocks? Appreciate you answering my qs!!, these notebooks have idioms I can understand but not sure how to navigate for finding details. I expect there’s probably some problems with the weird extent of MURSST that might mess with the wrap (but I’ll check, the warper is usually really good with these issues).

Michael_Sumner · September 19, 2024, 11:31pm

oh! I just realized the crux of the struggle as you put it. I will explore that from the GDAL side. Yes it’s a very big dataset to be generating 1, 4, or 16 tiles from at native resolution. It works within the COG processing though, it is memory-high but it fits within a lambda task that is writing whole-object to S3. This is really why I’m creating COGs, so I can use large areas from MURSST without decimating from native resolution. I don’t think there’s really a great answer, it just will take time and memory to visit every tile for the few pixels needed, but exactly then maybe there’s some tweaks available there for tilers (like what about generating the zooms >2 and then using zoom 3 to generate 0,1,2?)

Michael_Sumner · September 19, 2024, 11:43pm

and of course, we could push through the backlog an publish the whole dataset on an open S3 and then tiling would be utterly trivial by visiting the tiles in the COGs, but I’m worried about the calculated average values at lower zooms per latitude. I would appreciate some conversation about that in particular.

gunbra32 · September 20, 2024, 7:28am

Your use cases seem to have in common the transformation to COG, which seems to be motivated by visualisation use case. In xcube, we decided to create multi-resolution zarrs, and are very satisfied also for visualisation purposes. Since the future format for the Sentinels will be zarr anyway, operating expensive format conversion workflows does not seem to be the most efficient approach.

To this end, we created a multi-resolution zarr format, which is basically a pyramid of ds, persisted as zarrs, with some metadata. A similar approach is now also discussed here for the upcoming geozarr spec. Our main use case for this is also visualisation, e.g. here.

The viewer is also available as a Jupyter plugin, see minimalistic example below for opening a large (~1TiB each band) multi-level ds from s3 and visualising it interactively with low memory requirements, here typically <2GiB.

Len · September 23, 2024, 5:10pm

I added an example of using gdal with zarr (on gcs) to build a vrt for resampling and opening with rioxarray here:

gdalbuildvrt -tr .1 .1 test.vrt "vrt://ZARR:/vsigs/store/group:/variable_1:1?a_ullr=-180,90,180,-90"

Then you can open lazily like

rioxarray.open_rasterio("test.vrt", chunks="auto")

You could likely do the same thing but with gdal warped for reprojection.

I regularly use vrt inside rioxarray.open_rasterio and it supports lazy evaluation pretty well. I have even used gdalwarp to reproject all to many disjoint tiles into a common crs, then used gdalbuildvrt to resample all being done in a lazy fashion with dask and finally open with rioxarray with chunks.

My most common use case is going from COG to Zarr though. I don’t actively reproject/resample zarr this way.

For the case with many many cogs to Zarr, I have found these GDAL env vars are important to tune:

GDAL_HTTP_MULTIPLEX
CPL_VSIL_CURL_CACHE_SIZE
VSI_CACHE
GDAL_CACHEMAX
GDAL_HTTP_VERSION
VSI_CACHE_SIZE
GDAL_DISABLE_READDIR_ON_OPEN
GDAL_HTTP_MERGE_CONSECUTIVE_RANGES

I added a gist here to show how I first build a mosaic for all tifs in a crs (think one UTM zone) using gdalbuildvrt, then reproject resample with gdalwarp for each crs, then finally merge into one large mosaic with gdalbuildvrt and open with rioxarray.open_raster using dask.

Topic		Replies	Views
Example which highlights the limitations of NetCDF-style coordinates for large geospatial rasters	46	2276	July 2, 2025
Resampling Zarr-backed Xarray wth VRTs and GDAL support for Zarr	8	802	November 11, 2023
Netcdf to Zarr best practices Data	13	10368	February 10, 2021
Open Planetary Engine :globe_showing_europe_africa: Open Science	7	269	May 26, 2025
Xarray for raster data (DEMs) with inconsistent spatial extent Data	10	3016	January 6, 2024

Can a reprojection/change of CRS operation be done lazily using rioXarray?

Related topics