There are currently two more-or-less obvious ways for storing zarr datasets on IPFS. This topic should introduce those different possibilities and be a forum for discussing pros and cons of either ones.
Unixfsv1 is the standard IPFS encoding for filesystems. It is one of the earliest codecs (predating IPLD) and carries some protocol-buffers based legacy. However, it is widely supported e.g. by IPFS Gateways and looks like a usual filesystem, thus it’s the default way to encode files and directory structures on IPFS. However, as the filesystem-layer is not really needed to map zarr onto IPFS (see IPLD below), this approach seems to be less elegant and some features (in particular the use for checksumming may be more difficult).
In order to put a zarr dataset onto IPFS using unixfsv1, one can use
ipfs add -r -H --raw-leaves dataset.zarr
which adds the dataset and returns the computed
CID of the dataset.
Afterwards, the zarr dataset can be accessed from IPFS using several methods:
- via HTTP through a gateway, e.g.:
- via ipfsspec, e.g.:
- by mounting IFPS locally, e.g.:
IPLD has been developed later on and strives to be a generic basis for linked data in the world of distributed data. The IPLD data model is close to JSON, with the addition of a type for
bytes and another type for Links (which point to content identifiers). It’s relatively straight forward to directly map zarr to IPLD datastructures, which is what ipldstore does. Representing zarr directly on native IPLD structures could be advantageous, because internal data items (e.g. the
title-attribute) of a Dataset would become directly addressable using standard IPLD tools, and all the individually hashed blocks would be visible to the store library (and maybe even within zarr or xarray later on). Especially the latter may be beneficial when implementing checksums for chunks. Subjectively, this approach seems to be the “cleaner” one.