Tables, (x)arrays, and rasters¶

Hi all,

I’ve been playing around with some ideas for working with geospatial raster data. I’d be curious for any feedback you have.

The core question: what’s the best data model for raster data in Python? Unsurprisingly, I think the answer is “it depends”. Let’s use work through a concrete task and evaluate the various options. Suppose we wanted to compute NDVI for all the scenes captured by Landsat 8 over a couple of hours. (Full notebook at Jupyter Notebook Viewer)

We’ll use the Planetary Computer’s STAC API to find the scenes, and geopandas to plot the bounding boxes of each scene on a map.

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1"
)

items = catalog.search(
    collections=["landsat-8-c2-l2"],
    datetime="2021-07-01T08:00:00Z/2021-07-01T10:00:00Z"
).get_all_items()


items = [planetary_computer.sign(item) for item in items]
items = pystac.ItemCollection(items, clone_items=False)
df = geopandas.GeoDataFrame.from_features(items.to_dict(), crs="epsg:4326")

# https://github.com/geopandas/geopandas/issues/1208
df["id"] = [x.id for x in items]
m = df[["geometry", "id", "datetime"]].explore()
m

This type of data can be represented as an xarray DataArray. But it’s not the most efficient way to store the data:

To build this (time, band, y, x) DataArray, we end up with many missing values. If you think about the datacube literally, with some “volume” of observed pixels, we have a lot of empty space. In this case, the DataArray takes 426 TiB to store.

Even if we collapse the time dimension, which probably makes sense for this dataset, we still have empty space in the “corners”

This helps a lot, getting us down to 3.6 TiB (the curse of dimensionality works in reverse too!) But it’s still not as efficient as possible because of that empty space in the corners for this dataset. To actually load all these rasters into, say, a list would take much less memory: 218.57 GiB.

So for this dataset (I cannot emphasize that enough; this example was deliberatly designed to look bad for a data cube) it doesn’t make sense to model the data as a DataArray.

data model memory (TiB)
xarray (time, band, y, x) 426
xarray (band, y, x) 3.6
list 0.2

In the Python data science space, we’re fortunate to have both xarray and pandas (and geopandas and dask.dataframe). So we have choices! pandas provides an extension array interface to store non-NumPy arrays inside a pandas DataFrame. What would it look like to store STAC items (and more interestingly, rasters stored as DataArrays) inside a pandas DataFrame? Here’s a prototype:

Let’s load those STAC items into a an “ItemArray”, which can be put in a pandas.Series

>>> import rasterpandas

>>> sa = rasterpandas.ItemArray(items)
>>> series = pd.Series(sa, name="stac_items")
>>> series
0      <Item id=LC08_L2SP_191047_20210701_02_T1>
1      <Item id=LC08_L2SP_191046_20210701_02_T1>
2      <Item id=LC08_L2SP_191045_20210701_02_T1>
3      <Item id=LC08_L2SP_191044_20210701_02_T1>
4      <Item id=LC08_L2SP_191043_20210701_02_T1>
                         ...                    
112    <Item id=LC08_L2SP_175010_20210701_02_T1>
113    <Item id=LC08_L2SP_175006_20210701_02_T1>
114    <Item id=LC08_L2SP_175005_20210701_02_T2>
115    <Item id=LC08_L2SP_175001_20210701_02_T1>
116    <Item id=LC08_L2SR_159248_20210701_02_T2>
Name: stac_items, Length: 117, dtype: stac

Notice the stac dtype. Pandas lets you register accessors. For example, we could have a stac accessor that knows how to do stuff with STAC metadata, for example adding a column for each asset in the collection.

rdf = series[:10].stac.with_rasters(assets=["SR_B2", "SR_B3", "SR_B4", "SR_B5"])
rdf

Now things are getting more interesting! The repr is a bit messy, but this new DataFrame has a column for each of the blue, green, red, and nir bands. Each of those is a column of rasters. And each raster is just an xarray.DataArray!

>>> ndvi = rdf.raster.ndvi("SR_B4", "SR_B5")
>>> type(ndvi)
pandas.core.series.Series

Each element of the output is again a raster (stored in a DataArray).

So that’s the prototype (source code is at GitHub - TomAugspurger/rasterpandas). It’s a fun demonstration of pandas’ extension arrays. Is it useful? Maybe. The Spark / Scala world have found that model useful, as implemented by rasterframes. We have xarray, which lowers the need for something like this. We could also use lists of DataArrays, but the DataFrame concept is pretty handy, so I think this might still be useful.

Anyway, if you all have thoughts on whether something like this seems useful, or if you have workflows for similar kinds of data then I’d love to hear your feedback.

8 Likes

Tom just a quick note to say that I really think you’re onto something here. You have articulated a central problem which is hard to tackle with our current stack. (Also closely related to the L2 satellite data processing challenge identified by @cgentemann in Get Involved in Pangeo: Entry Points for New Contributors - #7 by cgentemann). Your solution elegantly leverages the existing tools (Pandas, Xarray, and STAC) in a really neat way. I would definitely keep pursuing this.

I think the next step would be to define a workflow–say to create a cloud-free mosaic of a region or something like that–and see how this approach maps. I would talk to Chelle, @RichardScottOZ, and others working with L2 data to identify some candidate workflows.

1 Like

Yes, this is really interesting Tom.

[and I have some Databricks compute to burn, so might be interesting to run a rasterframes version too]

Something I have been considering recently is seismic data stored around the country - and 2D lines are basically what is of mining interest - what is the best structure/format for analysis there, similarly.

1 Like

This is indeed an interesting approach! It seems quite complementary to @TomNicholas’s DataTree project, with all the advantages here that a pandas DataFrame may have over a more complex, hierarchical structure.

Regarding your use case (i.e., compute NDVI for all the scenes captured by Landsat 8), an alternative approach would be to directly produce a “dense” DataArray object that doesn’t really represents a DataCube but that would rather stack all raster pixels along one spatial dimension. We would loose all raster implicit topology, but I don’t think it’s a big deal for pixelwise operations, which IMO would be easier to perform with this approach than using two nested data structures (DataFrame + DataArray). Moreover, I think we’ll be able to do powerful things by combining both Xarray accessors and Xarray custom indexes (ready soon, hopefully!), e.g., provide some convenient API for selecting and re-transforming the data back into a more conventional raster format.

Leveraging Xarray with non-DataCube-friendly data is (sort of) what we’ve experimented with Xoak using ocean model data on curvilinear grids or unstructured meshes as use cases. (note: Xoak will eventually provide custom, Xarray-compatible indexes instead of implementing everything in an accessor).

2 Likes

Thanks for the link to DataTree. I’ll see if I can make a comparison.

Yeah, that’s potentially worth exploring too. In https://github.com/TomAugspurger/planetary-computer-deep-dives/blob/main/Geospatial%20Machine%20Learning.ipynb we do a bit of that to concatenate many scenes together (that also stacks them, which doesn’t have to happen). I think that works very well when you have identically-sized “items”, since you can keep the y / x labels around in a non-dimension coordinate. I’m less sure how it would work if you have variable-sized items (somewhat common if you’re just working with “raw” satellite scenes).

In theory you could have an additional coordinate along the stacked spatial dimension that represents raster (or stac item) id values, but yeah getting back the individual rasters as 2D DataArrays would be at the cost of many unstack operations. E.g., with this toy example of two concatenated (and stacked) rasters of variable size:

stacked_raster_id_coords = [
    [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    [0., 0., 1., 1., 3., 3., 3., 4., 4., 4., 5., 5., 5.],
    [0., 1., 0., 1., 3., 4., 5., 3., 4., 5., 3., 4., 5.]
]
midx = pd.MultiIndex.from_arrays(
    stacked_raster_id_coords, names=("raster_id", "y", "x")
)
data = xr.DataArray(
    np.random.uniform(size=(4+9, 3)),
    coords={"space": midx, "band": [1, 2, 3]}, dims=("space", "band")
)

print(data)
<xarray.DataArray (space: 13, band: 3)>
array([[0.14189417, 0.58131929, 0.35225833],
       [0.97512305, 0.36157414, 0.64397666],
       [0.32119297, 0.93799738, 0.37879202],
       [0.44954318, 0.69639306, 0.65414954],
       [0.21128336, 0.18527528, 0.478475  ],
       [0.96264641, 0.89752556, 0.17590214],
       [0.534152  , 0.64762276, 0.91105744],
       [0.28872876, 0.64881346, 0.57804076],
       [0.855744  , 0.83493434, 0.01243857],
       [0.99203143, 0.38273195, 0.72474495],
       [0.78296827, 0.30311979, 0.41697918],
       [0.92853232, 0.46370617, 0.6316989 ],
       [0.86078516, 0.22592683, 0.35725647]])
Coordinates:
  * space      (space) MultiIndex
  - raster_id  (space) int64 0 0 0 0 1 1 1 1 1 1 1 1 1
  - y          (space) float64 0.0 0.0 1.0 1.0 3.0 3.0 ... 4.0 4.0 5.0 5.0 5.0
  - x          (space) float64 0.0 1.0 0.0 1.0 3.0 4.0 ... 4.0 5.0 3.0 4.0 5.0
  * band       (band) int64 1 2 3

Get one raster as a 2D DataArray:

raster0 = data.sel(raster_id=0).unstack()
print(raster0)
<xarray.DataArray (band: 3, y: 2, x: 2)>
array([[[0.39386449, 0.45238512],
        [0.15979935, 0.40926863]],

       [[0.07529218, 0.96071543],
        [0.15065281, 0.50611628]],

       [[0.11695715, 0.34876804],
        [0.39314546, 0.89090393]]])
Coordinates:
  * band     (band) int64 1 2 3
  * y        (y) float64 0.0 1.0
  * x        (x) float64 0.0 1.0

Get all rasters as a list of 2D DataArrays:

raster_list = [
    da.reset_index("raster_id", drop=True).unstack()
    for _, da in data.groupby("raster_id")
]

raster_list
[<xarray.DataArray (band: 3, y: 2, x: 2)>
 array([[[0.39386449, 0.45238512],
         [0.15979935, 0.40926863]],
 
        [[0.07529218, 0.96071543],
         [0.15065281, 0.50611628]],
 
        [[0.11695715, 0.34876804],
         [0.39314546, 0.89090393]]])
 Coordinates:
   * band     (band) int64 1 2 3
   * y        (y) float64 0.0 1.0
   * x        (x) float64 0.0 1.0,
 <xarray.DataArray (band: 3, y: 3, x: 3)>
 array([[[0.42742665, 0.30037112, 0.41910118],
         [0.27372398, 0.21368011, 0.32303203],
         [0.61795123, 0.15125551, 0.14177116]],
 
        [[0.58677836, 0.17936755, 0.78520273],
         [0.20886481, 0.81725117, 0.5277651 ],
         [0.53852069, 0.56333296, 0.30122524]],
 
        [[0.03892721, 0.16013487, 0.75304758],
         [0.52124076, 0.14832651, 0.95243819],
         [0.53577372, 0.35633979, 0.59934574]]])
 Coordinates:
   * band     (band) int64 1 2 3
   * y        (y) float64 3.0 4.0 5.0
   * x        (x) float64 3.0 4.0 5.0]

Unstack may be expensive and not really necessary, but still required if we want to reuse existing fonctions like xrspatial.multispectral.ndvi that only accepts 2D DataArrays. That said, with stacked rasters we can compute the same NDVI for the whole collection of rasters by just writing

red = data.sel(band=1)
nir = data.sel(band=2)
ndvi = (nir - red) / (nir + red)

which is much more expressive (and perhaps more efficient too, depending on how data chunks are affected by stacking the stac items?).

So in summary, to answer your core question in your top comment, I think that both data models (one DataArray with the stacked space dimension vs. one DataFrame of DataArrays) have their pros and cons. The question is whether we really care about the individual 2D rasters or we’re only interested in the individual samples / pixels (i.e., the collection of rasters is only viewed as a data acquisition detail). Sometimes both matter and/or we’d need to switch between one or the other model if we want to reuse some existing libraries (e.g., xrspatial vs. scikit-learn). So it would be great if we could have some convenient API for that! I guess the conversion would be also pretty cheap as it only requires metadata.

Note that using a pandas.MultiIndex for the concat/stacked rasters may not be optimal here, but with the Xarray indexes refactor I guess that it will be possible to create a custom Xarray index tailored to this case (which could also potentially hold other metadata for each individual raster that would be cumbersome to concatenate along extra DataArray coordinates). Or maybe it will be enough to just set multiple pd.Index instances for each coordinate along the space dimension (not possible today but it will be allowed after the indexes refactor)…

I think another sensible problem (which partially intersects with the above one) is the lack of a standard workflow for the processing of geolocated irregular time series (i.e. weather sensors, irregular satellite overpass).
A naive approach would be to loop over each single time series, read it into a pd.Series or xr.DataArray, place it into a list/dict and apply function by looping over it.
We could instead develop an approach similar to rasterpandas (i.e. tspandas) enabling pandas/geopandas Dataframe to have column storing time series (as pd.Series/DaFrame / xr.DataArray/Dataset) and methods to apply distributed custom computations to all timeseries (i.e. homogenization, statistical analysis, regularization / termporal interpolation, )
Each tspandas row would correspond to a geolocated observation, the standard dataframe columns would encode “static” sensor/timeseries attributes/features, while the multiple time series columns would represent multiple measured variables, which could eventually be condensed into a single time series column if they share same timesteps.

Quick note: almost all astronomy image processing looks like this, involving jitter (small offsets of a few/non-integer pixels) and mosaicing (large offsets or about the image size) and often multiple detectors in an image and dead-space. Often over time and multiple passbands.

Just to clarify: would this usecase be satisfied by a long-form table with columns like this?

timestamp geometry value

(or value might be multiple columns like precipitation, temperature, etc.)

Or the nested / grouped nature of each timeseries important?

This proposed workflow sounds a bit like the R sits library’s data model: Setup | sits: Satellite Image Time Series Analysis on Earth Observation Data Cubes.

Hi Tom!

Thanks for pointing me to the R package sits … I was not aware of it. It is exactly the idea that I was referring to. I actually also already implemented in R a personal library (not packaged yet) to perform time series processing in such a way, with the difference that the list columns contain zoo or xts time series objects, and it wraps not only data.frame but also sf spatial objects (the python equivalent of geopandas).

Based on my experience with that, when dealing with hundred of thousands of timeseries with high-temporal resolution, the long table format

|ID| <ID_attributes> | geometry | timestep | <ts_values>|

is not efficient for multiple reasons:

  • The table starts to have millions/billions of rows

  • The need to always group over ID before applying computation is highly inefficient. One could implement some smart caching/indexing but then would need to update it during filtering/subset calls.

  • Each time series is often accompanied by static attributes (# timestep, start/end time, flags, min/mean/max/std values, …) that do not change over time. The long table format is highly memory inefficient in the first place because requires duplicating such information over and over. Having a small dataset of 1000 10-years long time series, with an hourly resolution, with just 4 static attributes per timeseries would result in repeating/wasting 1 x 24 x 365 x 10 x 4 x 1000 = 350’000’000 cell values.

  • In a second place, if the user desire to select/filter the table based on some of the static attributes, it should go over millions of rows, which is clearly also highly inefficient. Obviously, an alternative would be to store the static attributes separately, but the nested structure directly solves the problem.

  • If the timeseries of a given ID do not share the same timesteps, the table starts to be filled by NaNs, which is also memory inefficient.

  • Also, if the user desires to have columns with timeseries with different temporal resolutions (i.e. daily/monthly/annual), then a long-format table would require also some sort of temporal_resolution_ID column to avoid messing up when doing temporal groupby operation.

  • The use of a long-format table constrains each timeseries to have the same data type (a single data type per column), while a nested solution would free from such constrain.

  • If each timeseries is saved on disk in separate folders/files, the creation of the table involves a lot of concatenation and a lot of checks (common data type, filling NaN when missing column, …)

  • Some considerations should also involve the overhead of the split-apply-combine approach when the number of rows increases. R data.frame/tibble/data.table have a column-oriented data storage format (similar to vaex), while pandas is row-based.

An important point that is also important to bring up is that while long-form tables are easy to be saved on disk efficiently (i.e. parquet), the envisioned development of a tspandas class with nested objects would require some thoughts regarding the design of the hierarchical disk storage format. An idea would be a nested structure with <ts_database_name> / / <Timeseries_Column_Names>.

The | ID | <ID_attributes> | geometry | table could be saved in Parquet, while each timeseries in Parquet/Zarr /NetCDF depending on the nested object type (pandas or xarray).

Hi everyone, and thanks @TomAugspurger for this very interesting topic!

Just wanted to say that I would be really interested by a trade-off in this case between the Pandas approach described by Tom, and a Xarray but non (real) Datacube approach mentioned above by @benbovy. What would be the Pros and Cons of each approach?

Just to be sure I understood correctly, advancing with the Pandas approach would mean building accessors, or links between the Pandas API and the DataArray or stac items series?

1 Like

I would also prefer to retain the dense representation, but with tricks to keep the data of sparse type in memory.

Look at the following example with pandas multiindex & sparse dtype:

The dense data uses ~40 MB of memory, while the dense representation with sparse dtypes uses only ~0.5 kB of memory!

And while you can import dataframes with the sparse=True keyword, the size seems to be displayed inaccurately (both are the same size?), and we cannot examine the data like we can with pandas multiindex + sparse dtype:

Besides, a lot of operations are not available on sparse xarray data variables (i.e. if I wanted to group by price level for ffill & downsampling):

So, it would be nice if xarray adopted pandas’ approach of unstacking sparse data.

In the end, you could extract all the non-NaN values and write them to a sparse storage format, such as TileDB sparse arrays.

Thanks for the link to DataTree. I’ll see if I can make a comparison.

Once I have the arrays variable from your notebook @TomAugspurger , this is what it looks like if I load your data into DataTree:

!pip install git+https://github.com/xarray-contrib/datatree.git
dt = DataTree.from_dict(arrays)
dt
DataTree('root', parent=None)
├── DataTree('LC08_L2SP_191047_20210701_02_T1')
│   Dimensions:                                     (time: 1, band: 4, x: 7532,
│                                                    y: 7692)
│   Coordinates: (12/30)
│     * time                                        (time) datetime64[ns] 2021-07...
│       id                                          (time) <U31 'LC08_L2SP_191047...
│     * band                                        (band) <U5 'SR_B2' ... 'SR_B5'
│     * x                                           (x) float64 6.981e+05 ... 9.2...
│     * y                                           (y) float64 2.195e+06 ... 1.9...
│       landsat:scene_id                            <U21 'LC81910472021182LGN00'
│       ...                                          ...
│       title                                       (band) <U27 'Blue Band (B2)' ...
│       proj:transform                              object {0.0, -30.0, 698085.0,...
│       common_name                                 (band) <U5 'blue' ... 'nir08'
│       center_wavelength                           (band) float64 0.48 ... 0.86
│       full_width_half_max                         (band) float64 0.06 ... 0.03
│       epsg                                        int64 32631
│   Data variables:
│       stackstac-00624d8333968b598c716b2f5bafc447  (time, band, y, x) float64 dask.array<chunksize=(1, 1, 7692, 7532), meta=np.ndarray>
├── DataTree('LC08_L2SP_191046_20210701_02_T1')
│   Dimensions:                                     (time: 1, band: 4, x: 7702,
│                                                    y: 7852)
│   Coordinates: (12/30)
│     * time                                        (time) datetime64[ns] 2021-07...
│       id                                          (time) <U31 'LC08_L2SP_191046...
│     * band                                        (band) <U5 'SR_B2' ... 'SR_B5'
│     * x                                           (x) float64 1.008e+05 ... 3.3...
│     * y                                           (y) float64 2.357e+06 ... 2.1...
│       landsat:scene_id                            <U21 'LC81910462021182LGN00'
│       ...                                          ...
│       title                                       (band) <U27 'Blue Band (B2)' ...
│       proj:transform                              object {0.0, -30.0, 2356815.0...
│       common_name                                 (band) <U5 'blue' ... 'nir08'
│       center_wavelength                           (band) float64 0.48 ... 0.86
│       full_width_half_max                         (band) float64 0.06 ... 0.03
│       epsg                                        int64 32632
│   Data variables:
│       stackstac-4b8a8ffb8c81eabc1282a4a440cad38d  (time, band, y, x) float64 dask.array<chunksize=(1, 1, 7852, 7702), meta=np.ndarray>
│   
AND SO ON....

I can then use datatree to map xarray operations over the different nodes:

dt.sel(band='SR_B2')
DataTree('root', parent=None)
├── DataTree('LC08_L2SP_191047_20210701_02_T1')
│   Dimensions:                                     (time: 1, x: 7532, y: 7692)
│   Coordinates: (12/30)
│     * time                                        (time) datetime64[ns] 2021-07...
│       id                                          (time) <U31 'LC08_L2SP_191047...
│       band                                        <U5 'SR_B2'
│     * x                                           (x) float64 6.981e+05 ... 9.2...
│     * y                                           (y) float64 2.195e+06 ... 1.9...
│       landsat:scene_id                            <U21 'LC81910472021182LGN00'
│       ...                                          ...
│       title                                       <U27 'Blue Band (B2)'
│       proj:transform                              object {0.0, -30.0, 698085.0,...
│       common_name                                 <U5 'blue'
│       center_wavelength                           float64 0.48
│       full_width_half_max                         float64 0.06
│       epsg                                        int64 32631
│   Data variables:
│       stackstac-00624d8333968b598c716b2f5bafc447  (time, y, x) float64 dask.array<chunksize=(1, 7692, 7532), meta=np.ndarray>
├── DataTree('LC08_L2SP_191046_20210701_02_T1')
│   Dimensions:                                     (time: 1, x: 7702, y: 7852)
│   Coordinates: (12/30)
│     * time                                        (time) datetime64[ns] 2021-07...
│       id                                          (time) <U31 'LC08_L2SP_191046...
│       band                                        <U5 'SR_B2'
│     * x                                           (x) float64 1.008e+05 ... 3.3...
│     * y                                           (y) float64 2.357e+06 ... 2.1...
│       landsat:scene_id                            <U21 'LC81910462021182LGN00'
│       ...                                          ...
│       title                                       <U27 'Blue Band (B2)'
│       proj:transform                              object {0.0, -30.0, 2356815.0...
│       common_name                                 <U5 'blue'
│       center_wavelength                           float64 0.48
│       full_width_half_max                         float64 0.06
│       epsg                                        int64 32632
│   Data variables:
│       stackstac-4b8a8ffb8c81eabc1282a4a440cad38d  (time, y, x) float64 dask.array<chunksize=(1, 7852, 7702), meta=np.ndarray>
│
AND SO ON...

I should be able to do the same for .where, .mean etc.

The datatree can be thought of as a dict-like container of datasets, which can be arbitrarily nested and will map standard xarray operations recursively over all nodes. Each node contains the contents of one xarray.Dataset, so can hold individual DataArrays if you wish, as well as node-specific metadata.

Is this of any use to you?

(@rabernat turns out it was easy to load it in)

1 Like

I wonder if Dask DataArray can be coerced into supporting “block sparse” operations. Internally it is basically a dictionary of index tuples -> delayed np.ndarray. If you know which blocks have no data at graph construction time (what odc-stac is doing for example) then you could skip populating those. Problem is that this is not a supported mode of DataArray, it seems to assume that every block has some data as far as I can tell. So you have to populate those slots with a cheaper to compute function that just returns np.full of the right shape and value. But that means your dask graph is still huge and you are forced to process those empty slots as you perform further computations.

It would be cool if dask datarray supported block level “nan”, unpopulated blocks would then be skipped when doing map_blocks and all operations that use that, you should still be able to turn empties into real data with compute. Slicing and rechunking logic might get even more complicated but should still be possible I think. Only extra bit of information needed from the user is “nan” value to use when pixel data types is not float.

Anyone knows what zarr or tiledb do, do they support missing blocks?

2 Likes

Dask (distributed) does that optimization in a minor way, and it’s used in stackstac at stackstac/to_dask.py at 0bc305f4cb64e2b35cb6b997eff9342a5cfa6a1d · gjoseph92/stackstac · GitHub. NumPy arrays created using broadcast_to will be efficiently serialized and deserialized (I don’t think the same applies to np.full).

You do still pay the task overhead, and operations on that tends to fully materialize the array, so something like you describe might still be helpful.

2 Likes

thanks for that, I’ll switch to np.broadcast, but that’s a relatively minor optimisation in our case as we also re-use the same empty block so it doesn’t get copied a lot anyway or duplicated in RAM.

Somewhat related, so I thought I’ll share it here.

By working in a rotated space that matches native orientation of the data we can significantly reduce memory requirements. It is not a generic solution, but it can still be useful in a lot of cases. That’s why next version of odc-stac==0.3.0rc1 supports arbitrary destination image planes. Produced arrays do not have to be axis aligned to CRS units.

This narrow and tall image displayed below in QGIS was produced on Planetary Computer and is only about 4.7Gb counting all the overviews, 30m pixels in EPSG:3857. Uncompressed size is about 9Gb so it fits into memory of a single instance without any issues leaving enough space to perform COG construction in RAM.

To read more see:

3 Likes

Some further discussion on the stackstac repository about using sparse backing to improve memory efficiency for sparse geospatial data Consider optional support for sparse array backing. · Discussion #178 · gjoseph92/stackstac · GitHub.

1 Like

Linking some interesting discussion on the Arrow side: GeoArrow + Raster? · Issue #24 · geoarrow/geoarrow · GitHub. I haven’t read it in detail yet but it looks like the proposed storage model (i.e., “Array of rasters”) is very close to the “DataFrame of DataArrays” concept you have been experimenting with, @TomAugspurger.