Parallel creation of STAC catalogs

keewis · March 14, 2024, 12:49pm

On the pangeo call yesterday I mentioned that I started looking into parallel assembly / writes of STAC catalogs while potentially reusing parts of pangeo-forge.

What I figured out so far is:

to assemble the nested structure all pystac is doing is modify the item / catalog to add, then add a link to the catalog
to save a catalog, it basically iterates over the child catalogs, constructs the target url and recursively calls save, and does the same for items (minus the recursive parts)

I might be missing something, but with that I think we can create catalogs from files
in a couple of steps:

construct items from files
construct the catalog(s)
modify items to be consistent with the corresponding catalog
aggregate item links and add them to the catalog
adjust catalog links depending on the catalog type and write the individual objects (catalogs or items) to disk (independently, since they’re joined by links now)

In the call, I believe @sharkinsspatial mentioned having thought about this but deciding to not write to static catalogs, but rather write to a database. Does that sound about right?

sharkinsspatial · March 14, 2024, 7:18pm

Hi @keewis . Hopefully I can provide a bit of background for this based on our previous experiences. I’m often working closely with data providers who are generating large numbers of STAC Items representing all of their data holdings. In the initial days of STAC the idea of hierarchy of linked, static parent/child files in object storage that would be easily crawlable for indexing seemed fairly reasonable. But as the scale the collections we were managing grew both in size and frequency of data additions and removals, it became clear that the static, hierarchy model would be difficult to manage.

Making atomic updates to the child links of a Collection file was a difficult concurrency problem. We implemented a queue based approach to batch updates into a reasonable size rather than making individual writes but this required additional infrastructure to manage.
Removal of child links was inefficient. Link objects are in an array and required iteration when an Item link needed to be removed.
Parsing large collection files was inefficient. The size of our Collection files grew quite large considering that some of our collections contain > 50M items and required parsing the entire collection file rather than supporting streaming or paging.

For most of our projects it was apparent that managing our Items in a data store that supported concurrency and atomicity would be an easier approach. In work with NASA, AWS and Planetary Computer we use stac-fastapi with a pgstac backend. We’ve tried to provide IAC templates for deploying these packages with simple, intelligent defaults with the eoAPI library that we maintain.

In the case that there is an absolute requirement for static Collection and Item files you could periodically export a representation from your data store. But in practice for large archives of data we’ve found that the STAC API experience is better than working with large numbers of static objects.

Out of curiosity, what is the data for which you are generating large STAC Collections and where is it stored? For datasets managed in S3 we have a fairly basic infrastructure package we often use called stactools-pipelines to manage creating and ingesting new STAC Items to an API as data is published to a bucket. We haven’t been maintaining it closely but let me know if it might be helpful for you.

weiji14 · March 14, 2024, 7:20pm

It looks like pystac isn’t keen on supporting parallel writes (Write Items or collections in parallel · Issue #690 · stac-utils/pystac · GitHub), but you could also look into the Rust-based stac_async - Rust (maybe someone can write a Python wrapper around it?), if you’re still keen on writing static JSON catalogs. I’ll let Sean chime in on the STAC API method though which might make more sense depending on what scale you’re operating on (edit: oh yep, he posted just a minute before me ).

rabernat · March 14, 2024, 8:37pm

Just seconding what @sharkinsspatial said in simpler terms: this is a database problem. The minute you are starting to think about how to update something in a consistent way from multiple processes, you are effectively building a database. There is 70 years of research and theory on how to do this right.

I think that many of the problems that our community faces in terms of data management boil down to the fact that we often use files where a database would be more appropriate.

From the perspective of maintaining a big data archive in the cloud, I’m basically ready to retract this blog post I wrote almost six year ago:

After several years of trying to put that into practice, I no longer think that object storage by itself is a silver bullet for building a cloud-native data repository. It simply doesn’t offer the sorts of transactional guarantees you need to do this right.

Sorry for going meta!

TomAugspurger · March 14, 2024, 9:06pm

I think that many of the problems that our community faces in terms of data management boil down to the fact that we often use files where a database would be more appropriate.

Agreed.

That said, running a database and giving access to it (directly or through APIs) it much harder than putting files on Blob Storage, which is where services like Earthmover / Earth Search / Planetary Computer, or maybe tools like datasette, come in.

Topic		Replies	Views
Creating searchable STAC catalog from COGs in S3 Data	10	2006	December 14, 2023
STAC and Earth Systems datasets Data	23	4531	October 24, 2022
Is there a write-up about Pangeo's use of Intake? Data	2	755	October 25, 2019
Generating COGs and STAC items from DataArrays Data	4	2461	November 24, 2021
Pystac_client cannot load STAC catalog Data	3	1214	January 17, 2023

Parallel creation of STAC catalogs

Related topics