Conflict-free Replicated Zarr

There’s a data structure that is almost a meme in the distributed data world: the CRDT. These use algebraic magic to create strong eventual consistency. When a store syncs to the other copies, all versions will eventually resolve to the same state, even if updates occur offline. This all happens peer-to-peer — unlike other replication systems, this needs no central party.

I just read today about the updates on a Python CRDT library, and it got me thinking (Usage - pycrdt): Could this intersect with Zarr?

What if Zarr metadata stores were CRDTs? If we could make binary blobs atomically coupled with changes to the metadata, does that mean we could have a decentralized, but equivalent Zarr stores in multiple places in the cloud?

My interface brain is curious if this could be done with something like a mixin to a Zarr store class. (On second thought, maybe just simple composition is better.) You’d bring your own store, but register it as a CRDT, and choose when you perform the sync operation. Users could host daily sync updates, or react to a bucket change subscription, or update before modifications.

Optimizations, like tail call recursion, could jump to the final bucket state when applying a long chain of operations.

Anyway, this idea have been rumbling around in my mind, and I was curious if it resonated with anyone. And, how and why would this not work out?

3 Likes

Optimizations, like tail call recursion, could jump to the final bucket state when applying a long chain of operations.

I like the ideas you’re describing! a while ago I was thinking about something loosely related: how to avoid reprocessing entire archives when we “mutate them”. This could be a reprocessing campaign or fixing QA flags on the original data etc. These incremental changes won’t need to be applied right away, perhaps only upon data usage thus the original data could remain untouched. This would allow that replicas won’t have to be totally rewritten from scratch. I believe some of these ideas are implemented to some degree on Apache Iceberg(Earthmover wink wink). The idea of CRDTs resonates with me at the metadata level, it will be really interesting to explore this, like you I’d be curious to learn what people think.

1 Like

Thanks for the thoughts, Luis! Glad this resonates.

I do really like that this data model gives encourages communal maintenance of the data without having a central authority or owner. I wonder how this would impact data governance.

I agree, I think for Zarr-CRDTs to work, we’d have to adopt techniques pioneered at EarthMover – like metadata tracking of each bucket item, to say the least.

Depending on the CRDT implementation, I wonder if we’d get some of these fusion-like optimizations “for free”: CRDT Survey, Part 1: Introduction - Matthew Weidner (State-based CRDTs).

I was looking at the recordings of a favorite conference of mine, and I found this talk: A peer-to-peer spatial database

TIL about peermaps.org, a group who are building p2p mapping infra on top of ipfs and the like. This talk focuses on the database core (GitHub - peermaps/eyros: interval database), an engine that performs multi-dimensional geospatial queries using a combination of tree data structures. Mentioned briefly in the talk, this engine is amenable to distributed stores, like CRDTs.