Comparing odc.stac.load and stackstac for raster composite workflow

Just joining this thread, but I think this point is incredibly important. You need to know how coords are specified in the metadata! Nice job illustrating the fuzziness in the example!

I ran a similar test for stackstac.stac and it appears that this difference does not apply to xy_coords?

The four images above were built from code like this:

l30_stack = ss.stack(itemsL30, 
                    assets = ['B05'], 
                    epsg = epsg, # epsg (3979 if set to poly) # epsg (32614 if set to UTM)
                    resolution = 30, 
                    bounds = bbox, 
                    resampling = Resampling.bilinear, # bilinear # cubic # nearest
                    chunksize = (1, -1, -1, -1), 
                    xy_coords = 'topleft', # center # topleft
                    gdal_env = gdalEnv)

With just changes to xy_coords and epsg (keeping to local UTM or reprojecting to national proj, as I do in my work). All of these used bilinear resampling, but same patterns held true for cubic and nearest neighbor.

As you can see, xy_coords does not impact “blurriness” like anchor does. The only change is the values are shifted half a pixel between topleft and center. The blurriness in this case occurs when reprojecting from the local UTM grid to a national projection - as would be expected (and required for wide-area analysis when you need everything projected onto the same pixel grid).

So that begs the question, is topleft or center the “correct” pixel placement? I though it would be topleft for HLS, as noted in the User Guide and mentioned by @Andrew_Mullen. But it actually appears that center is most accurate to the real world:

In these comparisons, left is topleft and right is center. These are using the national projection, not UTM, but the patterns are the same.

So I guess the takeaway is that for stackstac that xy_coords = center is the proper way, at least for HLS. I guess this is because it integrates with xarray, which likes pixel center coordinates. And that it is not work the same as anchor does with odc.stac.load?

EDIT: I also played around some with snap_bounds = False in stackstac.stack and it does impact pixel values, but I can’t really tell if its preferred over the default (snap_bounds = True) if already reprojecting?

2 Likes

What anchor= does in odc-stac is NOT at all the same as stackstacs xy_coords=. Data that comes out of odc-stac always places location of the center of the OUTPUT pixel in the x/y coordinates. Specifying anchor= only changes how OUTPUT pixel grid is snapped to the world, once the OUTPUT pixel grid is decided, data from sources will be “re-projected” to the OUTPUT grid unless OUTPUT pixel grid is exactly aligned with the source data pixel grid, in which case these pixels are just pasted into the appropriate section of the output image (non-blurry output case, anchor on OUTPUT matches anchor used by the source grid).

In the case of stackstac, xy_coords= changes what point within the pixel is recorded in the x,y coordinate arrays, it does not change at all how pixels are loaded. And the ONLY VALID CHOICE is to use xy_coords=center, because that’s what the rest of xarray ecosystem assumes is stored in those coords. Note that the default is xy_coords=topleft, which is WRONG and confusing, and should be changed in stackstac lib. Hence no difference in the image whether it’s center or topleft @ZZMitch.

3 Likes

I’d like to share a benchmark report by @Kirill888 comparing odc-stac and stackstac, which highlights the superior performance of odc-stac in many common use cases:

Benchmark: odc-stac vs stackstac

Personally, I remain hesitant to adopt stackstac due to its limited test coverage as written in the stackstac’s limitations.

This discussion from two years ago still provides a good summary of the key differences between the two libraries.

That said, I’m still exploring how best to use either odc-stac or stackstac to access original Sentinel-2 SAFE products via the CDSE STAC API. If anyone has experience or insights on this, I’d appreciate your input!

2 Likes

After further investigation, I realized with we cannot use odc-stac and stackstac to access original Sentinel-2 SAFE products via the CDSE STAC API. We therefore implemented our own workflows in xcube-stac to use native data from the CDSE STAC API.

Example for Sentinel-2 via CDSE and Planetary Computer STAC API and for Sentinel-3 SYNERGY product via CDSE STAC API are here.

Thanks for sharing @konstntokas , on your github it states

During evaluation, we also considered odc-stac and stackstac for stacking STAC items.
However, both libraries rely on rasterio.open (GDAL drivers) to read data, which prevents accessing data directly from the CDSE S3 endpoint due to blocked AWS environments.

Are you sure it’s still the case? I haven’t worked with that catalog, but there has been a similar issue recently, that was resolved by configuring AWS_VIRTUAL_HOSTING=False in GDAL via rasterio, see here: Problems loading sentinel images from Copernicus Data Space Ecosystem AWS S3 · Issue #238 · opendatacube/odc-stac · GitHub

I’m curious as to what xcube-stac is using to access the data under the hood instead of GDAL?

It still uses GDAL JP2OpenJPEG raster driver.

I also tried to configure odc-stac with the credentials and endpoint-url and AWS_VIRTUAL_HOSTING=False, without success. The porblem is, that the lazy opening of the data is done in a context manager of rasterio.env.Env in odc-stac. If you want to access the data later during downloading and writing the data, it does not have any authentication. I resolved this in a rather ugly way by creating the rasterio.env.Env and keep the environment open. It is not nice, but I cannot imagine any other way. See: xcube-stac/xcube_stac/accessors/sen2.py at main · xcube-dev/xcube-stac · GitHub

maybe set (and unset) the gdal config via env vars

(to make that concrete: generic_gdal_config.md · GitHub , a nice way to manage context that isn’t bound to a single package, Can also be set via API with GDAL and with –config args in the cli. )

That’s not how odc-stac works, it is supposed to capture user configuration at the time of calling odc.stac.load then restore it, in a smart way, just before reading the data (that can be happening in a different IO thread or even on a different machine, when distributed Dask cluster is used).

Apparently people are having problems (sometimes) when configuring non-AWS buckets that require customizing endpoint_url and disabling virtual hosting, but surely this is fixable in either odc-{stac,loader} or in rasterio.session.AWSSession , or maybe in GDAL itself.

@Michael_Sumner going via os.environ works, but it won’t be thread-safe, so you better not be reading from different buckets at the same time from the same Python process.

1 Like

I would be happy to see a better solution. Keep me in the loop. Thanks!

odc.stac.configure_rio(
    aws={
        "endpoint_url": "eodata.dataspace.copernicus.eu",
        "aws_access_key_id": credentials.access_key,
        "aws_secret_access_key": credentials.secret_key,
        "region_name": "default",
    },
    verbose=True,
    cloud_defaults=True,
    AWS_VIRTUAL_HOSTING=False,
)

having this configuration was sufficient for me to access data from CDSE custom bucket.

Hi @kirill.kzb, I just tried your solution in a clean environment created with odc-stac 0.5.2 but I’m getting the following error. Any idea about this?

Edit: it was due to the missing boto3 package, solved by installing it. I confirm that your solution works fine!

File ~/micromamba/envs/odc_stac/lib/python3.12/site-packages/odc/loader/_rio.py:357, in configure_rio(cloud_defaults, verbose, aws, **params)
    355 _set_default_rio_config(cloud_defaults=cloud_defaults, aws=aws, **params)
    356 if verbose:
--> 357     with _CFG.env():
    358         _dump_rio_config()

File ~/micromamba/envs/odc_stac/lib/python3.12/site-packages/odc/loader/_rio.py:227, in _GlobalRioConfig.env(self)
    225 session: Optional[Session] = None
    226 if self._aws is not None:
--> 227     session = AWSSession(**self._aws)
    228 return rasterio.env.Env(_local.session(session), **self._gdal_opts)

File ~/micromamba/envs/odc_stac/lib/python3.12/site-packages/rasterio/session.py:283, in AWSSession.__init__(self, session, aws_unsigned, aws_access_key_id, aws_secret_access_key, aws_session_token, region_name, profile_name, endpoint_url, requester_pays)
    281     self._session = SimpleNamespace(region_name=region_name)
    282 else:
--> 283     self._session = boto3.Session(
    284         aws_access_key_id=aws_access_key_id,
    285         aws_secret_access_key=aws_secret_access_key,
    286         aws_session_token=aws_session_token,
    287         region_name=region_name,
    288         profile_name=profile_name)
    290 self.requester_pays = requester_pays
    291 self.unsigned = aws_unsigned

AttributeError: 'NoneType' object has no attribute 'Session'
1 Like

@clausmichele looks like you hit an issue in rasterio they should add an assert to check that boto3 is available when creating AWS session that requires authorization. Might want to raise an issue with rasterio

In my opinion the odc-stac dependencies should be updated to include the [s3] optional dependencies of rasterio, see odc-stac/pyproject.toml at 897816935257f6d5715a3961b3e469a01612d13a · opendatacube/odc-stac · GitHub

this would install boto3 when creating an environment based on odc-stac.

I guess this could directly go into odc-loader seeing the content of this issue: Cleanup odc-stac dependencies · Issue #228 · opendatacube/odc-stac · GitHub

No. It’s optional for a reason, it’s a terrible dependency to manage due to constant updates, and it’s only needed for non-public S3 access.

1 Like

Ok clear, thanks for the clarification!