Hi all,
I’ve been playing around with some ideas for working with geospatial raster data. I’d be curious for any feedback you have.
The core question: what’s the best data model for raster data in Python? Unsurprisingly, I think the answer is “it depends”. Let’s use work through a concrete task and evaluate the various options. Suppose we wanted to compute NDVI for all the scenes captured by Landsat 8 over a couple of hours. (Full notebook at Jupyter Notebook Viewer)
We’ll use the Planetary Computer’s STAC API to find the scenes, and geopandas to plot the bounding boxes of each scene on a map.
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1"
)
items = catalog.search(
collections=["landsat-8-c2-l2"],
datetime="2021-07-01T08:00:00Z/2021-07-01T10:00:00Z"
).get_all_items()
items = [planetary_computer.sign(item) for item in items]
items = pystac.ItemCollection(items, clone_items=False)
df = geopandas.GeoDataFrame.from_features(items.to_dict(), crs="epsg:4326")
# https://github.com/geopandas/geopandas/issues/1208
df["id"] = [x.id for x in items]
m = df[["geometry", "id", "datetime"]].explore()
m
This type of data can be represented as an xarray DataArray. But it’s not the most efficient way to store the data:
To build this (time, band, y, x)
DataArray, we end up with many missing values. If you think about the datacube literally, with some “volume” of observed pixels, we have a lot of empty space. In this case, the DataArray takes 426 TiB to store.
Even if we collapse the time dimension, which probably makes sense for this dataset, we still have empty space in the “corners”
This helps a lot, getting us down to 3.6 TiB (the curse of dimensionality works in reverse too!) But it’s still not as efficient as possible because of that empty space in the corners for this dataset. To actually load all these rasters into, say, a list would take much less memory: 218.57 GiB.
So for this dataset (I cannot emphasize that enough; this example was deliberatly designed to look bad for a data cube) it doesn’t make sense to model the data as a DataArray.
data model | memory (TiB) |
---|---|
xarray (time, band, y, x)
|
426 |
xarray (band, y, x)
|
3.6 |
list | 0.2 |
In the Python data science space, we’re fortunate to have both xarray and pandas (and geopandas and dask.dataframe). So we have choices! pandas provides an extension array interface to store non-NumPy arrays inside a pandas DataFrame. What would it look like to store STAC items (and more interestingly, rasters stored as DataArrays) inside a pandas DataFrame? Here’s a prototype:
Let’s load those STAC items into a an “ItemArray”, which can be put in a pandas.Series
>>> import rasterpandas
>>> sa = rasterpandas.ItemArray(items)
>>> series = pd.Series(sa, name="stac_items")
>>> series
0 <Item id=LC08_L2SP_191047_20210701_02_T1>
1 <Item id=LC08_L2SP_191046_20210701_02_T1>
2 <Item id=LC08_L2SP_191045_20210701_02_T1>
3 <Item id=LC08_L2SP_191044_20210701_02_T1>
4 <Item id=LC08_L2SP_191043_20210701_02_T1>
...
112 <Item id=LC08_L2SP_175010_20210701_02_T1>
113 <Item id=LC08_L2SP_175006_20210701_02_T1>
114 <Item id=LC08_L2SP_175005_20210701_02_T2>
115 <Item id=LC08_L2SP_175001_20210701_02_T1>
116 <Item id=LC08_L2SR_159248_20210701_02_T2>
Name: stac_items, Length: 117, dtype: stac
Notice the stac
dtype. Pandas lets you register accessors. For example, we could have a stac
accessor that knows how to do stuff with STAC metadata, for example adding a column for each asset in the collection.
rdf = series[:10].stac.with_rasters(assets=["SR_B2", "SR_B3", "SR_B4", "SR_B5"])
rdf
Now things are getting more interesting! The repr is a bit messy, but this new DataFrame has a column for each of the blue, green, red, and nir bands. Each of those is a column of rasters. And each raster is just an xarray.DataArray!
>>> ndvi = rdf.raster.ndvi("SR_B4", "SR_B5")
>>> type(ndvi)
pandas.core.series.Series
Each element of the output is again a raster (stored in a DataArray).
So that’s the prototype (source code is at GitHub - TomAugspurger/rasterpandas). It’s a fun demonstration of pandas’ extension arrays. Is it useful? Maybe. The Spark / Scala world have found that model useful, as implemented by rasterframes. We have xarray, which lowers the need for something like this. We could also use lists of DataArrays, but the DataFrame concept is pretty handy, so I think this might still be useful.
Anyway, if you all have thoughts on whether something like this seems useful, or if you have workflows for similar kinds of data then I’d love to hear your feedback.