At CNES, we are currently working on providing an OpenEO interface to query and process our geospatial datasets. When implementing Open EO process graphs involving some reprojection (typically from UTM to EPSG 4326 in the case of Sentinel 2 images), we are facing a case where the reprojection is triggered upon Python call, breaking the laziness of the underlying Dask graph.
The OpenEO part is for context, but I think this is a general problem and I’m not sure if a software solution already exists. I’ll even go further and say if this kind of reprojection problem can be solved or implemented nicely in a given library on Xarray datasets, this would make (rio)Xarray really powerful to work with geo referenced rasters.
so, perhaps create the VRT with rasterio and then open that into xarray via rioxarray ? But, what is the final result? Many disconnected scenes in EPSG:4326? Or one big scene (perhaps at different time steps)? If the latter then you’ll need each warp step to be in alignment with the rest. I’m not familiar enough with the tooling in xarray/dask here so I’ll try this out at some point, but as above do you get one big xarray or many disconnected ones from this workflow, and in either case do they need to share the same overall (global) grid specification?
On the odc-geo side, I’ve had good results using xr_reproject, so you could do something like:
from odc.geo.xr import xr_reproject
reprojected = xr_reproject(ds, how='EPSG:4326', resampling=<whatever is fitting>)
(where ds would be your lazy Dataset or DataArray)
It will attempt to auto-guess the GSD if you only give it the CRS. Another option (probably safest) would be to create a geobox (odc.geo.geobox module) defining your target grid and do xr_reproject(ds, how=<your geobox>,...)
Just to add to the mix, there is yet another package on the scene for regridding with xarray:
This is aimed more at climate model grids and less at geospatial raster imagery. However, it would be great to have a unified approach that works with all types of grids.
xarray-regrid dev here. Currently xarray-regrid does not perform any reprojections, just regridding between rectilinear grids (within the same coordinate system). To implement reprojecting, a routine would be needed that can generate a weights matrix to convert between one grid and another. If you’re interested in this just open an issue at GitHub - xarray-contrib/xarray-regrid: Regridding utility for xarray
I don’t have any results to share yet, but I wanted to let people know that I’m working on comparing the amount of heap memory along with the time required to warp resample a high resolution dataset (MUR-SST), motivated by memory usage exceeding AWS lambda memory limits during tile generation. I wrote down the scope of this exercise in this readme if any one feels like providing comments on the plan.
we hit this limit exactly in lambda too, we requested they increase it for the service (we’re creating COGs, not warping in this instance though I did experiment with that at first, our code now is effectively a GDALTranslate() call but in a very odc-specific context)
You may also have a look at xcube, which may be provide what you are looking for. It uses pyproj, supports dask, and does not rely on gdal. We are happy for feedback of any kind! Example for changing crs Example for rectification
Also see the documentation, which may, however, be incomplete…
is there any example yet? Like what tile would you write (bbox, shape, crs)? We’ve been working heavily with MURSST to convert it to COG, and it works fine to “/vsicurl/https:…nc” to a variable (with earthdata creds) with a url and stream it to whatever tile is required amongst the many downstream gdal packages. A few of your approaches would be pretty much exactly the same I expect as the infra is common in osgeo.gdal, rasterio, raster_tools, geoutils, and odc.geo.
Hi @gunbra32, thanks for your intervention. I’ve been seeing Xcube for along time, but never had the chance (or the time) to take a deeper look at it. I was under the impression that to benefit from xcube there was an ingestion phase of the dataset first, is this not the case?
In the end, what are the difference between odc-geo, pyresample, xcube, raster-tools:
Do they rely on the same approach for reprojection?
Do they use a different set of low level library (e.g. GDDAL, rasterio, others)?
there is no ingestion of data required for using xcube, it works with xarray’s ds. We created a straightforward dataset convention, which is following CF in most parts. A ds must have lon/lat or x/y as spatial dimensions and coordinates and encode the CRS in a variable crs.
For resample_in_space(), xcube uses pyproj and dask-image and we generally avoid any high-level dependencies like odc-geo or gdal. I have no comprehensive overview of the other tools you mentioned but I guess there might be considerable overlap. Would be interesting to see the differences and similarities between them indeed.
For your specific questions, the struggle is with Web Mercator tiles at zoom levels 0-2. You can see the motivations for my ongoing exploration at the end of this notebook - Tile Generation Benchmarks using CMR – Tile Benchmarking. What infrastructure are you using for COG generation from MURSST?
What infrastructure are you using for COG generation from MURSST?
I prototyped on normal VMs and some HPC (slurm) with R and Python, but we’ve baked it into a lambda task now that’s been creating them for several months as new data comes in. I haven’t run the backlog yet as I was waiting for any major issues to appear (I have one niggle with the latitude-scaling for average resampling in overviews …).
For your specific questions, the struggle is with Web Mercator tiles at zoom levels 0-2.
So, just normal web mercator with 256x256 tiles for the 3 zooms. I am looking at the notebooks you listed but what I’m looking for is the grid specification for the output, which is seemingly embedded in packages I can’t can’t see the source for. (where is ‘cmr_tile_test’?). So to be explicit with pseudocode.
but, maybe you use 512 blocks? Appreciate you answering my qs!!, these notebooks have idioms I can understand but not sure how to navigate for finding details. I expect there’s probably some problems with the weird extent of MURSST that might mess with the wrap (but I’ll check, the warper is usually really good with these issues).
oh! I just realized the crux of the struggle as you put it. I will explore that from the GDAL side. Yes it’s a very big dataset to be generating 1, 4, or 16 tiles from at native resolution. It works within the COG processing though, it is memory-high but it fits within a lambda task that is writing whole-object to S3. This is really why I’m creating COGs, so I can use large areas from MURSST without decimating from native resolution. I don’t think there’s really a great answer, it just will take time and memory to visit every tile for the few pixels needed, but exactly then maybe there’s some tweaks available there for tilers (like what about generating the zooms >2 and then using zoom 3 to generate 0,1,2?)
and of course, we could push through the backlog an publish the whole dataset on an open S3 and then tiling would be utterly trivial by visiting the tiles in the COGs, but I’m worried about the calculated average values at lower zooms per latitude. I would appreciate some conversation about that in particular.
Your use cases seem to have in common the transformation to COG, which seems to be motivated by visualisation use case. In xcube, we decided to create multi-resolution zarrs, and are very satisfied also for visualisation purposes. Since the future format for the Sentinels will be zarr anyway, operating expensive format conversion workflows does not seem to be the most efficient approach.
To this end, we created a multi-resolution zarr format, which is basically a pyramid of ds, persisted as zarrs, with some metadata. A similar approach is now also discussed here for the upcoming geozarr spec. Our main use case for this is also visualisation, e.g. here.
The viewer is also available as a Jupyter plugin, see minimalistic example below for opening a large (~1TiB each band) multi-level ds from s3 and visualising it interactively with low memory requirements, here typically <2GiB.
You could likely do the same thing but with gdal warped for reprojection.
I regularly use vrt inside rioxarray.open_rasterio and it supports lazy evaluation pretty well. I have even used gdalwarp to reproject all to many disjoint tiles into a common crs, then used gdalbuildvrt to resample all being done in a lazy fashion with dask and finally open with rioxarray with chunks.
My most common use case is going from COG to Zarr though. I don’t actively reproject/resample zarr this way.
For the case with many many cogs to Zarr, I have found these GDAL env vars are important to tune:
I added a gist here to show how I first build a mosaic for all tifs in a crs (think one UTM zone) using gdalbuildvrt, then reproject resample with gdalwarp for each crs, then finally merge into one large mosaic with gdalbuildvrt and open with rioxarray.open_raster using dask.