Zarr on unixfsv1 vs on IPLD

There are currently two more-or-less obvious ways for storing zarr datasets on IPFS. This topic should introduce those different possibilities and be a forum for discussing pros and cons of either ones.

unixfsv1

Unixfsv1 is the standard IPFS encoding for filesystems. It is one of the earliest codecs (predating IPLD) and carries some protocol-buffers based legacy. However, it is widely supported e.g. by IPFS Gateways and looks like a usual filesystem, thus it’s the default way to encode files and directory structures on IPFS. However, as the filesystem-layer is not really needed to map zarr onto IPFS (see IPLD below), this approach seems to be less elegant and some features (in particular the use for checksumming may be more difficult).

In order to put a zarr dataset onto IPFS using unixfsv1, one can use

ipfs add -r -H --raw-leaves dataset.zarr

which adds the dataset and returns the computed CID of the dataset.

Afterwards, the zarr dataset can be accessed from IPFS using several methods:

  • via HTTP through a gateway, e.g.: xr.open_dataset("<https://ipfs.io/ipfs/CID>")
  • via ipfsspec, e.g.: xr.open_dataset("ipfs://CID")
  • by mounting IFPS locally, e.g.: ipfs mount and xr.open_dataset("/ipfs/CID")

IPLD

IPLD has been developed later on and strives to be a generic basis for linked data in the world of distributed data. The IPLD data model is close to JSON, with the addition of a type for bytes and another type for Links (which point to content identifiers). It’s relatively straight forward to directly map zarr to IPLD datastructures, which is what ipldstore does. Representing zarr directly on native IPLD structures could be advantageous, because internal data items (e.g. the title-attribute) of a Dataset would become directly addressable using standard IPLD tools, and all the individually hashed blocks would be visible to the store library (and maybe even within zarr or xarray later on). Especially the latter may be beneficial when implementing checksums for chunks. Subjectively, this approach seems to be the “cleaner” one.

2 Likes

Welcome Tobias and thanks for posting an interesting question! I’m sorry I didn’t reply earlier…COVID got me last week. :microbe:

I strongly favor the direct Zarr :arrow_right: IPLD approach, bypassing IPFS. As you note, it is simpler and cleaner, avoiding the unnecessary intermediate abstraction of files and blocks.

As we have learned recently thanks to the work by @martindurant, @lsterzinger, and @rsignell with kerchunk, basically every common legacy file format (netcdf3, netcdf4, grib, geotiff / COG) can be mapped 1:1 to the Zarr data model. If we had a way to way to store the Zarr data model in IPLD, we could effectively bring all of these formats along for the ride.


I’m curious: what metrics could we use to help decide among the two approaches? Performance? Convenience? Support in the ecosystem?

Those metrics good, probably we need all and maybe I’d add flexibility as well?

One other thing which would be possible with Zarr :arrow_right: IPLD relatively easily would be to store those IPLD blocks back on classic key-value stores (i.e. use ipldstore as a storage transformer). In that case, one could add checksumming to any other store.