Generate a raster from non-gridded lat-longs in rasterio / python?

Hi folks, I’m looking for advice on how best to work with un-gridded lat/long data as rasters in a python workflow. I’m a bit more familiar with the R stack, and while I’m usually fine for the cases where both languages can just bind GDAL drivers, apparently this is something of a GDAL edge case (GDAL ‘XYZ’ driver requires a gridded lat-long structure).

In R, this coercion from data.frame to raster is supported by any of the major libraries, e.g. like this:

df <- readr::read_csv("https://minio.carlboettiger.info/shared-data/gbif.csv")
r <- terra::rast(df, crs="epsg:4326")
terra::plot(r)

which creates the raster plot:

I imagine there’s an analogue utility for this in python but don’t really know where to look. Any pointers on how to do this?

4 Likes

Thanks for the useful question!

I believe that the basic way people do this is with rasterio.rasterize: Vector Features — rasterio 1.4dev documentation

Geocube provides a higher-level API: Example - Rasterizing Point Data — geocube 0.4.0 documentation

1 Like

This is a situation where to be honest I’d probably use GDAL command line to create the grid (GDAL Grid Tutorial — GDAL documentation), then use rioxarray to work with the raster output.

For better or worse there are almost certainly lots of packages that do this in Python each with somewhat different syntax and approaches. Here is a code example based on @rabernat’s suggestion that I’ve used in the past. It’s a lot more verbose than the R example, but I think explicitly specifying things like CRS and interpolation algorithm is good practice.

import geopandas as gpd
from geocube.api.core import make_geocube
from geocube.rasterize import rasterize_points_griddata # Different choices here

# Ordinary Dataframe
df = gpd.pd.read_csv('https://minio.carlboettiger.info/shared-data/gbif.csv')

# GeoDataFrame
gf = gpd.GeoDataFrame(df, 
                      geometry=gpd.points_from_xy(df.longitude, df.latitude), 
                      crs=4326)

# Rasterize Points to Gridded Xarray Dataset
geo_grid = make_geocube(
    vector_data=gf,
    measurements=['n'],
    resolution=0.25, # degrees
    rasterize_function=rasterize_points_griddata,
)

geo_grid.n.plot(cmap='gray_r');

1 Like

Wonderful, this makes a lot of sense!

I was just reading the gdal_grid docs and thinking the same thing – I agree the R methods would be better if they were more explicit about the interpolation and grid choice (there are other R packages that more directly bind gdal methods to do just this, e.g. futility/gdal_grid.R at main · hypertidy/futility · GitHub). The syntax for gdal_grid is still confusing me a bit though, but will try and get that working for comparison

The geocube module is new to me and looks like it’s about to become my best friend! The make_geocube method above does look more explicitly able to map to gdal_grid – out of curiosity, does this call out to gdal_grid internally or is this a pure-python rasterization?

2 Likes

GDAL is definitely behind the scenes for I/O and some other operations (via rasterio), but not GDALGridCreate() I think: The geocube/rasterize.py uses from scipy.interpolate import Rbf, griddata.

Hope it works well for you! Definitely open up discussions or issues on the geocube github as you explore it, the feedback from R/GDAL users is super valuable.

2 Likes

I am aware that you asked for a rasterio solution, however when I have to grid/interpolate irregularly geo-distributed data, I am prone to use verde library because it is really simple, straightforward and accurate.

This example here describes how to go from un-gridded lat/long data to a (naturally gridded) xarray object.

There is also a cross-validation technique (in this example) to assess the quality of the interpolation through sklearn.

3 Likes

Verde won’t work with something large scale though.

1 Like

Just another option using PyGMT which integrates with pandas and xarray. There’s a suite of options to do gridding like nearneighbor, sphinterpolate, surface, triangulate and xyz2grd (depending on whether you want to grid things in Cartesian or Geographic space). I’ll just show the Delaunay Triangulation one as an example here.

  1. Turn longitude/latitude/n data into an xarray.DataArray grid
import pandas as pd
import pygmt
import xarray as xr

# Ordinary Dataframe
df = pd.read_csv("https://minio.carlboettiger.info/shared-data/gbif.csv")
print(df.head())
#    longitude  latitude    n
# 0       12.3      50.6  711
# 1      176.5     -38.3  159
# 2     -102.1      19.9  168
# 3      -88.1      21.2   77
# 4       43.1      53.7   11

# Turn XYZ data into 2D grid using Delaunay triangulation
dataarray: xr.DataArray = pygmt.triangulate.regular_grid(
    x=df.longitude,
    y=df.latitude,
    z=df.n,
    region="d",  # global region from longitude -180 to +180
    spacing="1d",  # 1 arc degree grid spacing
)
# <xarray.DataArray 'z' (lat: 181, lon: 361)>
# array([[        nan,         nan,         nan, ...,         nan,
#                 nan,         nan],
#        [  2.9224806,   2.5833333,   2.6      , ...,         nan,
#                 nan,   2.9224806],
#        [  2.8449612,   2.366853 ,   2.1666667, ...,         nan,
#                 nan,   2.8449612],
#        ...,
#        [        nan, 130.36208  , 113.53846  , ...,  21.555555 ,
#          17.713467 ,         nan],
#        [        nan, 160.12338  , 149.14935  , ...,   1.5172414,
#          10.296875 ,         nan],
#        [        nan,         nan,         nan, ...,         nan,
#                 nan,         nan]], dtype=float32)
# Coordinates:
#   * lon      (lon) float64 -180.0 -179.0 -178.0 -177.0 ... 178.0 179.0 180.0
#   * lat      (lat) float64 -90.0 -89.0 -88.0 -87.0 -86.0 ... 87.0 88.0 89.0 90.0
# Attributes:
#     long_name:     z
#     actual_range:  [1.000e+00 4.052e+03]
  1. Plot the gridded data with coastlines for extra context.
fig = pygmt.Figure()
fig.grdimage(grid=dataarray, frame=True, cmap="SCM/lajolla")
fig.coast(shorelines="thinnest")
fig.colorbar(frame=["x+ln", "y+lunits"])
fig.show()

4 Likes

Thanks all, I love seeing these different solutions and exploring the consequences of the different assumptions made by the different gridding algorithms. Really nice.

I’d still be keen to work out an example that calls out to gdal_grid for this, which seems like it has good support for parallelization and on-disk computation for larger use cases.

Also just in the spirit of Pangeo open data, I wanted to share more detail on the original data source being used here, which comes from parquet snapshots of the GBIF database (available on the AWS open data registry). For ease-of-use, I shared the above csv file extract since the original database is a bit larger.

The notebook shows the approach I’ve taken to aggregate these ‘occurrence records’ of species – i.e. each row in the data is an observation of where a species occurred, along with lat/long coordinates. I’ve rather crudely just rounded and aggregated these in SQL to produce the example data above, I welcome feedback on that step as well if anyone has thoughts.

(for some reason discourse is blocking me from posting links to GitHub? so link to example notebook plain text:)

https://github.com/cboettig/example-notebooks/blob/main/gbif-duckdb-aws.ipynb
2 Likes

Thanks, hadn’t seen the PyGMT xarray trick.

gdalwarp lib is the go, it might even auto detect your geolocation arrays, or it can be set up to know them

gdal.Warp in the python osgeo modules, and also in rasterio