Parallel creation of STAC catalogs

rabernat · March 14, 2024, 8:37pm

Just seconding what @sharkinsspatial said in simpler terms: this is a database problem. The minute you are starting to think about how to update something in a consistent way from multiple processes, you are effectively building a database. There is 70 years of research and theory on how to do this right.

I think that many of the problems that our community faces in terms of data management boil down to the fact that we often use files where a database would be more appropriate.

From the perspective of maintaining a big data archive in the cloud, I’m basically ready to retract this blog post I wrote almost six year ago:

After several years of trying to put that into practice, I no longer think that object storage by itself is a silver bullet for building a cloud-native data repository. It simply doesn’t offer the sorts of transactional guarantees you need to do this right.

Sorry for going meta!

Topic		Replies	Views
Pangeo Showcase: "High-performance Python STAC tooling, backed by Rust" (Feb 5, 2025) Pangeo Showcase	8	588	February 13, 2025
Creating searchable STAC catalog from COGs in S3 Data	10	2223	December 14, 2023
Proposal: Expanding the xstac python tool to automate a few more of the hard parts Meta zarr	2	126	March 27, 2025
STAC and Earth Systems datasets Data	23	4758	October 24, 2022
Is there a write-up about Pangeo's use of Intake? Data	2	757	October 25, 2019

Parallel creation of STAC catalogs

Related topics