Tables, (x)arrays, and rasters¶

Hi all,

I’ve been playing around with some ideas for working with geospatial raster data. I’d be curious for any feedback you have.

The core question: what’s the best data model for raster data in Python? Unsurprisingly, I think the answer is “it depends”. Let’s use work through a concrete task and evaluate the various options. Suppose we wanted to compute NDVI for all the scenes captured by Landsat 8 over a couple of hours. (Full notebook at Jupyter Notebook Viewer)

We’ll use the Planetary Computer’s STAC API to find the scenes, and geopandas to plot the bounding boxes of each scene on a map.

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1"
)

items = catalog.search(
    collections=["landsat-8-c2-l2"],
    datetime="2021-07-01T08:00:00Z/2021-07-01T10:00:00Z"
).get_all_items()


items = [planetary_computer.sign(item) for item in items]
items = pystac.ItemCollection(items, clone_items=False)
df = geopandas.GeoDataFrame.from_features(items.to_dict(), crs="epsg:4326")

# https://github.com/geopandas/geopandas/issues/1208
df["id"] = [x.id for x in items]
m = df[["geometry", "id", "datetime"]].explore()
m

This type of data can be represented as an xarray DataArray. But it’s not the most efficient way to store the data:

To build this (time, band, y, x) DataArray, we end up with many missing values. If you think about the datacube literally, with some “volume” of observed pixels, we have a lot of empty space. In this case, the DataArray takes 426 TiB to store.

Even if we collapse the time dimension, which probably makes sense for this dataset, we still have empty space in the “corners”

This helps a lot, getting us down to 3.6 TiB (the curse of dimensionality works in reverse too!) But it’s still not as efficient as possible because of that empty space in the corners for this dataset. To actually load all these rasters into, say, a list would take much less memory: 218.57 GiB.

So for this dataset (I cannot emphasize that enough; this example was deliberatly designed to look bad for a data cube) it doesn’t make sense to model the data as a DataArray.

data model memory (TiB)
xarray (time, band, y, x) 426
xarray (band, y, x) 3.6
list 0.2

In the Python data science space, we’re fortunate to have both xarray and pandas (and geopandas and dask.dataframe). So we have choices! pandas provides an extension array interface to store non-NumPy arrays inside a pandas DataFrame. What would it look like to store STAC items (and more interestingly, rasters stored as DataArrays) inside a pandas DataFrame? Here’s a prototype:

Let’s load those STAC items into a an “ItemArray”, which can be put in a pandas.Series

>>> import rasterpandas

>>> sa = rasterpandas.ItemArray(items)
>>> series = pd.Series(sa, name="stac_items")
>>> series
0      <Item id=LC08_L2SP_191047_20210701_02_T1>
1      <Item id=LC08_L2SP_191046_20210701_02_T1>
2      <Item id=LC08_L2SP_191045_20210701_02_T1>
3      <Item id=LC08_L2SP_191044_20210701_02_T1>
4      <Item id=LC08_L2SP_191043_20210701_02_T1>
                         ...                    
112    <Item id=LC08_L2SP_175010_20210701_02_T1>
113    <Item id=LC08_L2SP_175006_20210701_02_T1>
114    <Item id=LC08_L2SP_175005_20210701_02_T2>
115    <Item id=LC08_L2SP_175001_20210701_02_T1>
116    <Item id=LC08_L2SR_159248_20210701_02_T2>
Name: stac_items, Length: 117, dtype: stac

Notice the stac dtype. Pandas lets you register accessors. For example, we could have a stac accessor that knows how to do stuff with STAC metadata, for example adding a column for each asset in the collection.

rdf = series[:10].stac.with_rasters(assets=["SR_B2", "SR_B3", "SR_B4", "SR_B5"])
rdf

Now things are getting more interesting! The repr is a bit messy, but this new DataFrame has a column for each of the blue, green, red, and nir bands. Each of those is a column of rasters. And each raster is just an xarray.DataArray!

>>> ndvi = rdf.raster.ndvi("SR_B4", "SR_B5")
>>> type(ndvi)
pandas.core.series.Series

Each element of the output is again a raster (stored in a DataArray).

So that’s the prototype (source code is at GitHub - TomAugspurger/rasterpandas). It’s a fun demonstration of pandas’ extension arrays. Is it useful? Maybe. The Spark / Scala world have found that model useful, as implemented by rasterframes. We have xarray, which lowers the need for something like this. We could also use lists of DataArrays, but the DataFrame concept is pretty handy, so I think this might still be useful.

Anyway, if you all have thoughts on whether something like this seems useful, or if you have workflows for similar kinds of data then I’d love to hear your feedback.

8 Likes

Tom just a quick note to say that I really think you’re onto something here. You have articulated a central problem which is hard to tackle with our current stack. (Also closely related to the L2 satellite data processing challenge identified by @cgentemann in Get Involved in Pangeo: Entry Points for New Contributors - #7 by cgentemann). Your solution elegantly leverages the existing tools (Pandas, Xarray, and STAC) in a really neat way. I would definitely keep pursuing this.

I think the next step would be to define a workflow–say to create a cloud-free mosaic of a region or something like that–and see how this approach maps. I would talk to Chelle, @RichardScottOZ, and others working with L2 data to identify some candidate workflows.

Yes, this is really interesting Tom.

[and I have some Databricks compute to burn, so might be interesting to run a rasterframes version too]

Something I have been considering recently is seismic data stored around the country - and 2D lines are basically what is of mining interest - what is the best structure/format for analysis there, similarly.

1 Like

This is indeed an interesting approach! It seems quite complementary to @TomNicholas’s DataTree project, with all the advantages here that a pandas DataFrame may have over a more complex, hierarchical structure.

Regarding your use case (i.e., compute NDVI for all the scenes captured by Landsat 8), an alternative approach would be to directly produce a “dense” DataArray object that doesn’t really represents a DataCube but that would rather stack all raster pixels along one spatial dimension. We would loose all raster implicit topology, but I don’t think it’s a big deal for pixelwise operations, which IMO would be easier to perform with this approach than using two nested data structures (DataFrame + DataArray). Moreover, I think we’ll be able to do powerful things by combining both Xarray accessors and Xarray custom indexes (ready soon, hopefully!), e.g., provide some convenient API for selecting and re-transforming the data back into a more conventional raster format.

Leveraging Xarray with non-DataCube-friendly data is (sort of) what we’ve experimented with Xoak using ocean model data on curvilinear grids or unstructured meshes as use cases. (note: Xoak will eventually provide custom, Xarray-compatible indexes instead of implementing everything in an accessor).

1 Like

Thanks for the link to DataTree. I’ll see if I can make a comparison.

Yeah, that’s potentially worth exploring too. In planetary-computer-deep-dives/Geospatial Machine Learning.ipynb at main · TomAugspurger/planetary-computer-deep-dives · GitHub we do a bit of that to concatenate many scenes together (that also stacks them, which doesn’t have to happen). I think that works very well when you have identically-sized “items”, since you can keep the y / x labels around in a non-dimension coordinate. I’m less sure how it would work if you have variable-sized items (somewhat common if you’re just working with “raw” satellite scenes).

In theory you could have an additional coordinate along the stacked spatial dimension that represents raster (or stac item) id values, but yeah getting back the individual rasters as 2D DataArrays would be at the cost of many unstack operations. E.g., with this toy example of two concatenated (and stacked) rasters of variable size:

stacked_raster_id_coords = [
    [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    [0., 0., 1., 1., 3., 3., 3., 4., 4., 4., 5., 5., 5.],
    [0., 1., 0., 1., 3., 4., 5., 3., 4., 5., 3., 4., 5.]
]
midx = pd.MultiIndex.from_arrays(
    stacked_raster_id_coords, names=("raster_id", "y", "x")
)
data = xr.DataArray(
    np.random.uniform(size=(4+9, 3)),
    coords={"space": midx, "band": [1, 2, 3]}, dims=("space", "band")
)

print(data)
<xarray.DataArray (space: 13, band: 3)>
array([[0.14189417, 0.58131929, 0.35225833],
       [0.97512305, 0.36157414, 0.64397666],
       [0.32119297, 0.93799738, 0.37879202],
       [0.44954318, 0.69639306, 0.65414954],
       [0.21128336, 0.18527528, 0.478475  ],
       [0.96264641, 0.89752556, 0.17590214],
       [0.534152  , 0.64762276, 0.91105744],
       [0.28872876, 0.64881346, 0.57804076],
       [0.855744  , 0.83493434, 0.01243857],
       [0.99203143, 0.38273195, 0.72474495],
       [0.78296827, 0.30311979, 0.41697918],
       [0.92853232, 0.46370617, 0.6316989 ],
       [0.86078516, 0.22592683, 0.35725647]])
Coordinates:
  * space      (space) MultiIndex
  - raster_id  (space) int64 0 0 0 0 1 1 1 1 1 1 1 1 1
  - y          (space) float64 0.0 0.0 1.0 1.0 3.0 3.0 ... 4.0 4.0 5.0 5.0 5.0
  - x          (space) float64 0.0 1.0 0.0 1.0 3.0 4.0 ... 4.0 5.0 3.0 4.0 5.0
  * band       (band) int64 1 2 3

Get one raster as a 2D DataArray:

raster0 = data.sel(raster_id=0).unstack()
print(raster0)
<xarray.DataArray (band: 3, y: 2, x: 2)>
array([[[0.39386449, 0.45238512],
        [0.15979935, 0.40926863]],

       [[0.07529218, 0.96071543],
        [0.15065281, 0.50611628]],

       [[0.11695715, 0.34876804],
        [0.39314546, 0.89090393]]])
Coordinates:
  * band     (band) int64 1 2 3
  * y        (y) float64 0.0 1.0
  * x        (x) float64 0.0 1.0

Get all rasters as a list of 2D DataArrays:

raster_list = [
    da.reset_index("raster_id", drop=True).unstack()
    for _, da in data.groupby("raster_id")
]

raster_list
[<xarray.DataArray (band: 3, y: 2, x: 2)>
 array([[[0.39386449, 0.45238512],
         [0.15979935, 0.40926863]],
 
        [[0.07529218, 0.96071543],
         [0.15065281, 0.50611628]],
 
        [[0.11695715, 0.34876804],
         [0.39314546, 0.89090393]]])
 Coordinates:
   * band     (band) int64 1 2 3
   * y        (y) float64 0.0 1.0
   * x        (x) float64 0.0 1.0,
 <xarray.DataArray (band: 3, y: 3, x: 3)>
 array([[[0.42742665, 0.30037112, 0.41910118],
         [0.27372398, 0.21368011, 0.32303203],
         [0.61795123, 0.15125551, 0.14177116]],
 
        [[0.58677836, 0.17936755, 0.78520273],
         [0.20886481, 0.81725117, 0.5277651 ],
         [0.53852069, 0.56333296, 0.30122524]],
 
        [[0.03892721, 0.16013487, 0.75304758],
         [0.52124076, 0.14832651, 0.95243819],
         [0.53577372, 0.35633979, 0.59934574]]])
 Coordinates:
   * band     (band) int64 1 2 3
   * y        (y) float64 3.0 4.0 5.0
   * x        (x) float64 3.0 4.0 5.0]

Unstack may be expensive and not really necessary, but still required if we want to reuse existing fonctions like xrspatial.multispectral.ndvi that only accepts 2D DataArrays. That said, with stacked rasters we can compute the same NDVI for the whole collection of rasters by just writing

red = data.sel(band=1)
nir = data.sel(band=2)
ndvi = (nir - red) / (nir + red)

which is much more expressive (and perhaps more efficient too, depending on how data chunks are affected by stacking the stac items?).

So in summary, to answer your core question in your top comment, I think that both data models (one DataArray with the stacked space dimension vs. one DataFrame of DataArrays) have their pros and cons. The question is whether we really care about the individual 2D rasters or we’re only interested in the individual samples / pixels (i.e., the collection of rasters is only viewed as a data acquisition detail). Sometimes both matter and/or we’d need to switch between one or the other model if we want to reuse some existing libraries (e.g., xrspatial vs. scikit-learn). So it would be great if we could have some convenient API for that! I guess the conversion would be also pretty cheap as it only requires metadata.

Note that using a pandas.MultiIndex for the concat/stacked rasters may not be optimal here, but with the Xarray indexes refactor I guess that it will be possible to create a custom Xarray index tailored to this case (which could also potentially hold other metadata for each individual raster that would be cumbersome to concatenate along extra DataArray coordinates). Or maybe it will be enough to just set multiple pd.Index instances for each coordinate along the space dimension (not possible today but it will be allowed after the indexes refactor)…

I think another sensible problem (which partially intersects with the above one) is the lack of a standard workflow for the processing of geolocated irregular time series (i.e. weather sensors, irregular satellite overpass).
A naive approach would be to loop over each single time series, read it into a pd.Series or xr.DataArray, place it into a list/dict and apply function by looping over it.
We could instead develop an approach similar to rasterpandas (i.e. tspandas) enabling pandas/geopandas Dataframe to have column storing time series (as pd.Series/DaFrame / xr.DataArray/Dataset) and methods to apply distributed custom computations to all timeseries (i.e. homogenization, statistical analysis, regularization / termporal interpolation, )
Each tspandas row would correspond to a geolocated observation, the standard dataframe columns would encode “static” sensor/timeseries attributes/features, while the multiple time series columns would represent multiple measured variables, which could eventually be condensed into a single time series column if they share same timesteps.

Quick note: almost all astronomy image processing looks like this, involving jitter (small offsets of a few/non-integer pixels) and mosaicing (large offsets or about the image size) and often multiple detectors in an image and dead-space. Often over time and multiple passbands.

Just to clarify: would this usecase be satisfied by a long-form table with columns like this?

timestamp geometry value

(or value might be multiple columns like precipitation, temperature, etc.)

Or the nested / grouped nature of each timeseries important?

This proposed workflow sounds a bit like the R sits library’s data model: Setup | sits: Data Analysis and Machine Learning on Earth Observation Data Cubes with Satellite Image Time Series.

Hi Tom!

Thanks for pointing me to the R package sits … I was not aware of it. It is exactly the idea that I was referring to. I actually also already implemented in R a personal library (not packaged yet) to perform time series processing in such a way, with the difference that the list columns contain zoo or xts time series objects, and it wraps not only data.frame but also sf spatial objects (the python equivalent of geopandas).

Based on my experience with that, when dealing with hundred of thousands of timeseries with high-temporal resolution, the long table format

|ID| <ID_attributes> | geometry | timestep | <ts_values>|

is not efficient for multiple reasons:

  • The table starts to have millions/billions of rows

  • The need to always group over ID before applying computation is highly inefficient. One could implement some smart caching/indexing but then would need to update it during filtering/subset calls.

  • Each time series is often accompanied by static attributes (# timestep, start/end time, flags, min/mean/max/std values, …) that do not change over time. The long table format is highly memory inefficient in the first place because requires duplicating such information over and over. Having a small dataset of 1000 10-years long time series, with an hourly resolution, with just 4 static attributes per timeseries would result in repeating/wasting 1 x 24 x 365 x 10 x 4 x 1000 = 350’000’000 cell values.

  • In a second place, if the user desire to select/filter the table based on some of the static attributes, it should go over millions of rows, which is clearly also highly inefficient. Obviously, an alternative would be to store the static attributes separately, but the nested structure directly solves the problem.

  • If the timeseries of a given ID do not share the same timesteps, the table starts to be filled by NaNs, which is also memory inefficient.

  • Also, if the user desires to have columns with timeseries with different temporal resolutions (i.e. daily/monthly/annual), then a long-format table would require also some sort of temporal_resolution_ID column to avoid messing up when doing temporal groupby operation.

  • The use of a long-format table constrains each timeseries to have the same data type (a single data type per column), while a nested solution would free from such constrain.

  • If each timeseries is saved on disk in separate folders/files, the creation of the table involves a lot of concatenation and a lot of checks (common data type, filling NaN when missing column, …)

  • Some considerations should also involve the overhead of the split-apply-combine approach when the number of rows increases. R data.frame/tibble/data.table have a column-oriented data storage format (similar to vaex), while pandas is row-based.

An important point that is also important to bring up is that while long-form tables are easy to be saved on disk efficiently (i.e. parquet), the envisioned development of a tspandas class with nested objects would require some thoughts regarding the design of the hierarchical disk storage format. An idea would be a nested structure with <ts_database_name> / / <Timeseries_Column_Names>.

The | ID | <ID_attributes> | geometry | table could be saved in Parquet, while each timeseries in Parquet/Zarr /NetCDF depending on the nested object type (pandas or xarray).