Pangeo Showcase: "High-performance Python STAC tooling, backed by Rust" (Feb 5, 2025)

Title: “High-performance Python STAC tooling, backed by Rust”
Invited Speaker: Pete Gadomski (ORCID: 0000-0003-4877-7217)
When: Wednesday, February 05, 2025 at 12 PM EST
Where: Launch Meeting - Zoom
Abstract:

The SpatioTemporal Asset Catalog (STAC) specification is an open, community-developed specification that enables large-scale, distributed search and discovery of geospatial assets. Part of the success of STAC has been due to its community-built tooling, written mostly in Python and Javascript, that was developed in tandem with the specification itself. As the specification and its usages have matured, we’ve seen the need to improve the software tooling ecosystem both through direct feature work on the existing libraries and by creating new libraries to cover new use-cases. In this talk, I’ll walk through the existing Python STAC ecosystem and showcase new developments, including stac-geoparquet innovations, STAC API queries using DuckDB, and cloud-storage-agnostic access for STAC and its assets. Much of this new tooling is written in Rust and exposed with Python bindings, so I’ll talk a bit about how that works, the benefits, and the drawbacks. Finally, I’ll make some not-so-bold predictions on where I think the STAC ecosystem might be headed in the next few years, and talk a bit about the relationship between STAC and other open specifications that are heavily used in the scientific geospatial community, specifically Zarr.

Agenda:

  • ~15 minutes - Showcase presentation
  • 10 - 30 minutes - Discussion
  • 15 - 30 minutes - Community check-in
11 Likes

@maxrjones will this be recorded? Would love to join but I’ve got a conflict.

Yes, it will be recorded and uploaded to YouTube within a week.

@thwllms the link to the youtube recording is available now at the top of the thread.

thanks! excellent and very interesting

1 Like

Awesome presentation!

I’m particularly interested in the STAC-FastAPI-GeoParquet stack, hoping that it will eliminate the need for maintaing (and paying for…) a database. I’m curious how well it will scale (I’m thinking spatially querying a collection of (tens of) billions of Items – postgres does it well enough), and how it will work with appending or deleting Items.

When spatially querying one of our APIs that uses Geoparquet as its storage backend, the random geometry is intersected against a known, predefined grid that has been loaded into memory as a geodataframe, the relevant grid tile ids are then used to query the geoparquet system and another intersects operation is done, this time on the actual geoparquet dataset.

This approach requires that every operation respects the grid, which can be a good thing and a bad thing.

I’m curious how well it will scale (I’m thinking spatially querying a collection of (tens of) billions of Items – postgres does it well enough)

It might take a bit of work on both the writing side (to organize the data well) and client side (to make sure the query exploits the data’s organization), but it should scale well to large datasets, and many concurrent readers.

appending or deleting Items.

You’ll probably want a table format like delta or iceberg, which build on top of parquet.

2 Likes

Regarding the status and the structure of the stac-fastapi-geoparquet project, does it involve developing a new CRUD CoreClient (plus relevant TransactionsClient and Extensions) that plug into the existing stac-fastapi implementation similar to how pgstac does (stac-fastapi-pgstac)?

Is there an (un)official repository yet?

1 Like

Yup!

Is there an (un)official repository yet?

Not yet but I’ll be making it public within a month or so (working it in the background right now) — I’ll have an accompanying blog post comparing its performance w/ other backends, etc.

1 Like