Content addressable data structures (and in particular IPLD) form immutable graphs. Whenever anything is changed, the root content identifier (CID) will change. However, identical content can always be addressed using the same CID, idependent of which dataset it is part of. This poses some challenges and may lead to interesting solutions:
If multiple workers should write concurrently to some dataset, they can’t easily write to the same location (folder, container, bucket etc…), e.g. using a region-write, because there is no such location. Instead it would be possible to concurrently write to independent datasets and afterwards join the different datasets to a common one. This would require some tooling which takes multiple datasets (by CID) and join them together to a common one. The interesting part would be, that it should be possible to join or rearrange datasets without touching the underlying data, but instead just rewrite the metadata structures. Similarly, it should be possible to convert between unixfs and IPLD based zarr, just by rewriting the higher-level objects.
I agree that the immutable / content-addressable framework poses some interesting challenges. In general, Zarr expects that the parent object (array or group) will get created before the chunks. However, with IPLD, the children need to get created first, because the parent can’t be constructed without the children’s hashes. It will be interesting to think through how this will work. I’m sure it is solvable–it just requires some thought and creativity.
In general, yes, I think CAS will require some specialized utilities to merge / extend / append to datasets. As you noted, it should be possible to do all of this at the level of pure metadata, without ever rewriting any of the actual data chunks. That would be very cool to see!
In your opinion, what next steps are needed to make progress?
as this splits blobs, it might help to get concatenation working in kerchunk for inspection of the resulting CARs
enable the use of HAMT to split up large metadata objects into multiple IPLD objects
sharded backends (if we want IPLD on classic object stores or file systems)
IPLD objects have to be < 2MiB to be really useful
we might want to use CAR as a serializarion format for putting multiple IPLD objects into single objects. Currently ipldstore only exports the full tree as a single CAR, but that might be too big.
renumbering chunk ids might help as well (that way, an IPLD-subtree would correspond to one shard)
general tooling for handling IPLD and CAR in Python
most of the tooling is only available in go, it’s not terribly hard to build parsers and writers for those things, but it takes time. There are a couple of tools in ipldstore, but probably we’ll need a proper implementation of the IPLD data model in Python at some point.
I’m quite busy doing other things at least a bit into July, but I’ve been in touch with a couple of people, also advertising this discourse a bit, I hope we’ll be growing and becoming more visible. Currently there are some folks playing around with some extensions to both, ipfsspec and ipldstore. Some colleagues currently are ingesting a bunch of data into zarr on unixfs, but there we also are at the point that we need more Python-tooling to inspect, modify and handle those things.