Comparing odc.stac.load and stackstac for raster composite workflow

Overview

I am a n00b when it comes to traditional GIS raster processing. I’m trying to better understand the available tools in this space.

I’m doing this through what seems to be a very typical workflow: compute a monthly median RGB image from Sentinel 2 data over a 0.5 x 0.5 degree region.

In this post, I compare the same workflow using odc.stac and stackstac. The two give slightly different results, and I’d like to understand why a bit better. More generally, as a user, it’s confusing to have two packages with such similar functionality in our ecosystem. While choice is good for users, it’s also good when we can align on a single set of best practices.

Step 1: STAC Search

I’m finding the assets that I want using STAC. This is not controversial. This seems to be the way that everyone is accessing this sort of data.

from datetime import datetime, timedelta

import xarray as xr
import pystac_client
import stackstac
import odc.stac
from odc.geo.geobox import GeoBox
from dask.diagnostics import ProgressBar
from rasterio.enums import Resampling

dx = 3/3600  # 90m resolution
epsg = 4326
# random location in the middle of the Amazon
bounds = (-64.0, -9.0, -63.5, -8.5)
minx, miny, maxx, maxy = bounds
geom = {
    'type': 'Polygon',
    'coordinates': [[
       [minx, miny],
       [minx, maxy],
       [maxx, maxy],
       [maxx, miny],
       [minx, miny]
    ]]
}


year = 2020
month = 1

start_date = datetime(year, month, 1)
end_date = start_date + timedelta(days=31)
date_query = start_date.strftime("%Y-%m-%d") + "/" + end_date.strftime("%Y-%m-%d")


items = pystac_client.Client.open(
    "https://earth-search.aws.element84.com/v1"
).search(
    intersects=geom,
    collections=["sentinel-2-l2a"],
    datetime=date_query,
    limit=100,
).item_collection()

print(len(items), "scenes found")

This returns 52 scenes.

Step 2A: Load Data with odc.stac

odc.stac way

# define a geobox for my region
geobox = GeoBox.from_bbox(bounds, crs=f"epsg:{epsg}", resolution=dx)

Screenshot 2024-03-11 at 10.55.18 PM

# lazily combine items
ds_odc = odc.stac.load(
    items,
    bands=["scl", "red", "green", "blue"],
    chunks={'time': 5, 'x': 600, 'y': 600},
    geobox=geobox,
    resampling="bilinear"
)

# actually load it
with ProgressBar():
    ds_odc.load()

This took 68 seconds on my VM in AWS US-East-1

# define a mask for valid pixels (non-cloud)

def is_valid_pixel(data):
    # include only vegetated, not_vegitated, water, and snow
    return ((data > 3) & (data < 7)) | (data==11)

ds_odc['valid'] = is_valid_pixel(ds_odc.scl)
ds_odc.valid.sum("time").plot()

# compute the masked median
rgb_median = (
    ds_odc[['red', 'green', 'blue']]
    .where(ds_odc.valid)
    .to_dataarray(dim="band")
    .transpose(..., "band")
    .median(dim="time")
)
(rgb_median / rgb_median.max() * 2).plot.imshow(rgb="band", figsize=(10, 8))

Step 2B: Load Data with `stackstac

ds_ss = stackstac.stack(
    items,
    bounds=bounds,
    assets=["scl", "red", "green", "blue"],
    epsg=epsg,
    resolution=(dx, dx),
    chunksize=(5, -1, 600, 600),
    resampling=Resampling.bilinear
).reset_coords(drop=True)

A big difference in API design is that odc.stac returns an xarray.Dataset, with each band in a different variable, while stackstac returns an xarray.DataArray, with all the bands combined into a single array. stackstac also defaults to float 64 for this data, while odc.stac uses much cheaper integer dtypes.

# actually load
with ProgressBar():
    ds_ss.load()

This took 66 seconds, nearly identical to odc.stac.

The mask looks nearly identical

valid_ss = is_valid_pixel(ds_ss[:, 0])
valid_ss.sum("time").plot()

image

However, the final image looks a bit different

rgb_median_ss = (
    ds_ss[:, 1:]
    .where(valid_ss)
    .transpose(..., "band")
    .median(dim="time")
)
(rgb_median_ss / rgb_median_ss.max() * 2).plot.imshow(rgb="band", figsize=(10, 8))

Comparison

Both packages took about the same time to process the data. But there were subtle differences in what they came up with.

They made different choices about what coordinates to use.

print(ds_ss.x.values[:5])
#  [-64.         -63.99916667 -63.99833333 -63.9975     -63.99666667]
print(ds_odc.longitude.values[:5])
# [-63.99958333 -63.99875    -63.99791667 -63.99708333 -63.99625   ]

And most importantly, the images that they produced look different, despite using ostensibly the same parameters.

I’d love if someone more knowledgeable could help me understand what’s happening here under the hood a bit better.

I’d also love it if someone could point out a way that I could be doing this better or more efficiently.

3 Likes

Hi @rabernat

I’m a bit stubbory, and stick to the OpenDataCube way of working, and you’ve explained most of the differences above, but there are a few things that I would do differently.

Firstly, with odc.stac.load there’s an option to pass in groupby="solar_day", which will merge adjacent scenes and do de-duplication of overlapping pixels. This is pretty important, as it merges satellite passes into single layers, removing the number of scenes to do the median from. I’m honestly not sure how stacstack handles this.

Second, with masking, we usually explicitly pick cloud and cloud shadow and then do a few operations to expand/contract that, to clean up single pixels and fill holes.

A simple example of this is this code, untested:

from odc.algo import erase_bad, mask_cleanup

CLOUD_SHADOWS = 3
CLOUD_HIGH_PROBABILITY = 9

for field in [CLOUD_SHADOWS, CLOUD_HIGH_PROBABILITY]:
    bitmask |= 1 << field

cloud_mask = xr.SCL.astype("uint16") & bitmask != 0

cloud_mask = mask_cleanup(cloud_mask, [("closing", 5), ("opening", 5)])

final_data = erase_bad(xr, cloud_mask)

Note that the odc.algo tools there are important, as they retain the int values in the data, which makes a difference when. you’re working with a lot of data. Using xarray.where converts data to float, which is nice because it means you can use nan for nodata, but sometimes it’s unwanted.

This process is fairly well tested and rigorous, and we used it to do the cloud-free mosaics over Africa, for example. (With variations on the closing/opening logic for tweaking the cloud mask.)

A worked example of the above workflow is available here using HLS data.

And again, I wanted to note that I think there’s a gap in some of the EO tooling now in that we should have a reasonably simple, agreed way to mask clouds the right way, and since we really have two main optical products (Landsat and Sentinel-2), we can make a lot of the parameters for this opinionated.

And I’m looking for support to write this tooling! Maybe I should start a cloud masking crowd funding mini-project!

5 Likes

A couple small comments first on the differences:

They made different choices about what coordinates to use.

This is a difference on the default for how to interpret whether a coordinate refers to the center of a pixel or the top-left corner. In your call to stackstac, if you set xy_coords="center" you’ll get the same result for the x and y dimensions

And most importantly, the images that they produced look different, despite using ostensibly the same parameters.

Similar story, I think, this time with the rescale parameter. The raster extension defines scale and offset values to rescale the data by when loading it. stackstac applies it by default, odc-stac doesn’t (yet?)

So I think with these settings, you’ll get comparable outputs (the lat / lon coordinates are equal according to np.testing.assert_allclose, but not exact)

ds_ss = stackstac.stack(
    items,
    bounds=bounds,
    assets=["scl", "red", "green", "blue"],
    epsg=epsg,
    resolution=(dx, dx),
    chunksize=(5, -1, 600, 600),
    resampling=Resampling.bilinear,
    xy_coords="center",
    rescale=False,
)

I haven’t double-checked the values yet and have to run, but hopefully there approximately similar (I think I see differences in nodata handling though). I’m also not sure which one is “right” for either of these.


clarification on difference between this library and stackstac? · Issue #54 · opendatacube/odc-stac · GitHub has a bit more discussion on the two projects. I think people are mostly agreed that having one project for this workload would be preferable.

2 Likes

Thank you both for this feedback! Really helpful stuff!

@Alex_Leith - really helpful to see that there is a clear “right way” to do the cloud masking in your community.

My opinion is that we have more than enough packages out there. I certainly don’t want another package which wraps odc or stackstac. (Maybe that’s what cubo is?) There are already enough layers to this stack (no pun intended) that it can be hard to figure out what’s going on. (For example, did I use GDAL here or not?)

What would be most valuable to me would be to simply document the best practices for this type of workflow in an easily discoverable place (ideally one open to contributions). Maybe Pythia Cookbooks?

:raised_hands: amazing!

This seems really important.

I just read the odc-stac issue. It’s a mess. I’m familiar with the scale / offset debate from NetCDF world. We don’t seem to have a clear understanding of where in the stack this should be applied. (For example, in Xarray + Zarr world, you can have scale and offset done by Xarray via NetCDF metadata + Xarray coding, or in Zarr as Zarr filters.)


In any case, as a new user, it’s confusing to me that there are two packages which do basically the same thing, but with subtle yet important differences in default behavior (e.g. rescale). It would be great if the community could come together and maintain one cannonical package. I know that’s a big ask, given the amount of developer time that has been invested into each solution.

If I want to compute a median, it seems like I have to convert to float anyway. Is there a recommended algorithm or best practice here?

Oh, I’m talking about adding functionality to odc-geo, and not a wrapper. More some post-loading tools to help with masking.

If I want to compute a median, it seems like I have to convert to float anyway. Is there a recommended algorithm or best practice here?

You don’t need a float to compute a median. Not sure I understand what you mean?

When it comes to doing big processing, when we’re talking loading a whole bunch of data at 10 m, if you double your memory size, it makes a difference. Also, it can be sensible to keep values unscaled, that way once you’ve calculated your median, you’re storing data the same way as it was stored in the original data you loaded.

In any case, as a new user, it’s confusing to me that there are two packages which do basically the same thing.

Yeah, I agree, and it’s confusing, but not much we can do about it! And the small differences between them add up too.

Nice work @TomAugspurger on highlighting the offset difference, which is a subtle and extremely important difference. The scaling of values is not really right or wrong, just different. Whether or not coords are in the middle or the corner of the raster cell is probably wrong.

1 Like

Which one is right though? :flushed:

Haha, well… honestly I just trust Kirill’s code in odc-stac to do the right thing! But I should know more certainly…

odc-stac places pixel centers into X/Y coords because various xarray tools, like plotting or sampling, assume that coordinates contains pixel center location. I think that stackstac choice of default xy_coords="edge" is a mistake and a major gotcha for users.

4 Likes

@rabernat

odc.stac way

odc.stac.load accepts large number of parameters for defining area to load, geobox is just the most precise way to specify the output pixel grid. You can also use bbox=(x0,y0,x1,y1), or x=(x0, x1), y=(y0, y1) to limit load area, and anchor=edge|center to control pixel snapping. Also specifying resolution and crs is optional, default is to use data from proj extension. One can also completely omit all spatial parameters and still get back expected output, assuming STAC metadata is complete with proj and raster extensions populated with correct data

Using xarray.Dataset allows odc-stac to use different dtype for each band, it’s really important when there is a mask band present, you don’t want to convert that to float32 because of some other band being loaded at the same time. With stackstac, you have to load mask band separately to avoid type conversion. Other advantages of xarray.Dataset:

  • Supports different number of dimensions across bands (RGB, hyperspectral)
  • Supports different projections across bands
  • Supports different resolutions across bands

@rabernat thanks for the simple example. I agree it would be nice to consolidate odc.stac and stackstac. I wanted to point out a couple other important high-level things for your approach:

I’m trying to better understand the available tools in this space.I’m doing this through what seems to be a very typical workflow: compute a monthly median RGB image from Sentinel 2 data over a 0.5 x 0.5 degree region.

  1. Over the last years I’ve come across STAC “collections” out there that are either incomplete, have metadata inconsistencies compared to the Geotiffs they reference, or mix datasets that you might not actually want to combine in your mosaic (e.g. items processed with different inconsistent software versions). So regardless of the software choice, carefully look at the metadata first. In your case, there are duplicate items processed with different software versions. Turns out the collection you’re using is being deprecated (see UPDATE: Sentinel-2 Collection-1 Archive Reprocessing, GitHub - Element84/earth-search: Earth Search information and issue tracking), and the recommendation is to use collections=["sentinel-2-c1-l2a"] . Right off the bat you’re down to 26 scenes instead of 52. *Also note this reprocessed collection is in us-west-2

This took 68 seconds on my VM in AWS US-East-1

  1. This seems slow. As a back of the envelope you’re reading subsets of 26 items * 4 bands = 104 Geotiffs. I’d expect that to take ~100ms per read so more like 10s serially. In the end, you are using GDAL with odc.stac, and for better or worse GDAL environment variables can have a big impact on read performance so don’t forget before loading to run this (documented in odc.stac examples):

odc.stac.configure_rio(cloud_defaults=True, aws={"aws_unsigned": True})

Then on a modest machine in us-west-2 you can get an order of magnitude speedup:

CPU times: user 2.7 s, sys: 3.47 s, total: 6.17 s
Wall time: 3.75 s

We don’t seem to have a clear understanding of where in the stack this should be applied.

  1. Agreed. I think it would be nice for odc.stac to have an opt-in to apply scale and offset if present in the metadata (Note GDAL and therefore rasterio do not have any keywords to do this automatically when reading Rasterio does not apply scale factor defined in GeoTIFF · Issue #2219 · rasterio/rasterio · GitHub)

  2. One final big difference between odc.stac and stackstac for this particular workflow where your output grid is significantly coarser (90m) than the native resolution of the COGs (10m), is that the software should read COG overviews for fewer reads and smaller data transfer. odc-stac does this currently, but stackstac does not (stackstac.stack doesn't read overviews · Issue #196 · gjoseph92/stackstac · GitHub).

3 Likes

on the scale/offset mentions for GDAL (3) you can write a virtual raster that is unscaled to file or in memory text, with Translate() -unscale option

since GDAL 3.8 this is available via the vrt protocol syntax, so no need to open and write to a VRT form, but use

“vrt://{dsn}?unscale=true”

then open that as the source and stream,
that works in the native bindings or in rasterio or any other downstream package (note that it implies use of “-ot” as well to specify the floating point type of the output). There are also -a_offset and -a_scale args to virtually assign the requisite values of the source, for subsequent opens that can then use “unscale”.

1 Like

Alex can you explain what this code is doing? I thought it would apply the mask for all of the different values in field, but it’s not working as I expected. Specifically, the CLOUD_SHADOWS are not actually being masked. Here’s an example scene, plotted via

ds.scl.plot()
cloud_mask.plot.contour(levels=(0.5,), colors='w')

You can clearly see the cloud shadow regions (confirmed to have SCL value of 3) outside the mask contour.

Pretty sure this code was meant for Fmask and not for SCL, since SCL is not a bit-mask, but categorical data instead. For SCL one would need something like ~scl.isin([3,9])

1 Like

Yup, sorry @rabernat. @kirill.kzb is right here.

1 Like

@rabernat can you please share what flavour VM you ran this on? You said in AWS US-East-1 but how many cpus on it? (and can we assume that all 52 scenes run at once, more or less? I assume xarray-dask will aggressively saturate but I’m not sure what the default stance is there). I’m working on some slightly different approaches and interested in what the timing is like.I can’t yet run your exact code in that tenancy yet, but I’ll be able to do that soonish.

fwiw, I get 52 scenes from the search with sentinel-2-l2a but 26 with sentinel-2-c1-l2a, and while they seem equivalent I don’t know why the search halves the scenes (there’s better cloud products in the latter).

We confirmed today that the sentinel-2-l2a search returns duplicates, apparently a known problem and not present in sentinel-2-c1-l2a. It affects runtime obviously.

(reprocessed hrefs have a _1 vs _0, can provide details if needed)