Some thoughts on checksumming here
opened 06:48PM - 16 Jan 19 UTC
#### Problem description
Having checksums for individual chunks is good for v… erifying the integrity of the data we're loading. The existing mechanisms for checksumming data are inadequate for various reasons:
1. **Checksum of the entire array's data**: This does not work for loading a subset of the data.
2. **Checksum of each individual chunk recorded by a filter as part of the chunk**: This does not protect against chunks being swapped, and does not help for building a persistent cache for previously read chunks.
Recording the checksums in the .zarray file could work, but may be problematic for larger data sets.
----
see also:
* https://github.com/zarr-developers/zarr-specs/issues/75
* https://github.com/zarr-developers/zarr_implementations/pull/33#discussion_r616361076
Ideally the checksum for data should be independent of the choice of chunking / compression, which is definitely not the case for traditional file-based checksumming. So we need to invent a new algorithm for this.
Also see this package