On the pangeo call yesterday I mentioned that I started looking into parallel assembly / writes of STAC catalogs while potentially reusing parts of pangeo-forge.
What I figured out so far is:
to assemble the nested structure all pystac is doing is modify the item / catalog to add, then add a link to the catalog
to save a catalog, it basically iterates over the child catalogs, constructs the target url and recursively calls save, and does the same for items (minus the recursive parts)
I might be missing something, but with that I think we can create catalogs from files
in a couple of steps:
construct items from files
construct the catalog(s)
modify items to be consistent with the corresponding catalog
aggregate item links and add them to the catalog
adjust catalog links depending on the catalog type and write the individual objects (catalogs or items) to disk (independently, since they’re joined by links now)
In the call, I believe @sharkinsspatial mentioned having thought about this but deciding to not write to static catalogs, but rather write to a database. Does that sound about right?
Hi @keewis . Hopefully I can provide a bit of background for this based on our previous experiences. I’m often working closely with data providers who are generating large numbers of STAC Items representing all of their data holdings. In the initial days of STAC the idea of hierarchy of linked, static parent/child files in object storage that would be easily crawlable for indexing seemed fairly reasonable. But as the scale the collections we were managing grew both in size and frequency of data additions and removals, it became clear that the static, hierarchy model would be difficult to manage.
Making atomic updates to the child links of a Collection file was a difficult concurrency problem. We implemented a queue based approach to batch updates into a reasonable size rather than making individual writes but this required additional infrastructure to manage.
Removal of child links was inefficient. Link objects are in an array and required iteration when an Item link needed to be removed.
Parsing large collection files was inefficient. The size of our Collection files grew quite large considering that some of our collections contain > 50M items and required parsing the entire collection file rather than supporting streaming or paging.
For most of our projects it was apparent that managing our Items in a data store that supported concurrency and atomicity would be an easier approach. In work with NASA, AWS and Planetary Computer we use stac-fastapi with a pgstac backend. We’ve tried to provide IAC templates for deploying these packages with simple, intelligent defaults with the eoAPI library that we maintain.
In the case that there is an absolute requirement for static Collection and Item files you could periodically export a representation from your data store. But in practice for large archives of data we’ve found that the STAC API experience is better than working with large numbers of static objects.
Out of curiosity, what is the data for which you are generating large STAC Collections and where is it stored? For datasets managed in S3 we have a fairly basic infrastructure package we often use called stactools-pipelines to manage creating and ingesting new STAC Items to an API as data is published to a bucket. We haven’t been maintaining it closely but let me know if it might be helpful for you.
It looks like pystac isn’t keen on supporting parallel writes (Write Items or collections in parallel · Issue #690 · stac-utils/pystac · GitHub), but you could also look into the Rust-based stac_async - Rust (maybe someone can write a Python wrapper around it?), if you’re still keen on writing static JSON catalogs. I’ll let Sean chime in on the STAC API method though which might make more sense depending on what scale you’re operating on (edit: oh yep, he posted just a minute before me ).
Just seconding what @sharkinsspatial said in simpler terms: this is a database problem. The minute you are starting to think about how to update something in a consistent way from multiple processes, you are effectively building a database. There is 70 years of research and theory on how to do this right.
I think that many of the problems that our community faces in terms of data management boil down to the fact that we often use files where a database would be more appropriate.
From the perspective of maintaining a big data archive in the cloud, I’m basically ready to retract this blog post I wrote almost six year ago:
After several years of trying to put that into practice, I no longer think that object storage by itself is a silver bullet for building a cloud-native data repository. It simply doesn’t offer the sorts of transactional guarantees you need to do this right.
I think that many of the problems that our community faces in terms of data management boil down to the fact that we often use files where a database would be more appropriate.
Agreed.
That said, running a database and giving access to it (directly or through APIs) it much harder than putting files on Blob Storage, which is where services like Earthmover / Earth Search / Planetary Computer, or maybe tools like datasette, come in.