Tool for validating geo data/services moved to the cloud?

Great question - I’ve been thinking about this problem recently in the context of Kerchunk.

Kerchunked datasets have a fundamental risk in that you create a virtual zarr store that refers back to some archival files, but if any of those files are altered or moved, your zarr store becomes different/corrupted/invalid with no warning.

Really the kerchunk data should include some information with it that allows you to verify that the files it points to have not been changed. You wouldn’t want to perform this check on every access, but you might want to perform it after some update, or after moving it. This could take the form of storing a checksum for each chunk alongside the path to the chunk in a zarr manifest file, for example (see the zarr chunk manifest ZEP idea):

# inside a zarr store there is a file
# a/foo/manifest.json
{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100, "checksum": e0e863980a465d40c3131a22eb3a306d5e19883793c5ca8a2ca642dea94e7123},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100, "checksum": c1077dca18dbefa136006a398e0c22951f0abbb7fbe5b50c6da6654d1a6571a3},  
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100, "checksum": 181272f95f1b3df932ce1ac39819ea16ef63073d9b619cac15c870b789120341},  
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100, "checksum": cead9997448adbabfed395429f938a78736c0b6b3586ab22d2a9403373991b56}, 
}

The extreme case of this would be a system that can check and guarantee integrity after any chunk is changed, and expects chunks to be changed regularly. That would be a database of chunks, and is the problem arraylake is meant to solve.

Is there some tool we’re missing from the community that can quickly compare data?

I think that for archival datasets that aren’t supposed to be changed (or only changed rarely) a database is overkill, and some small standalone tool that facilitates calculating checksums on large netCDF/zarr datasets could be useful.

If your compression is lossy then it’s trickier, but at least xarray.testing.assert_allclose() exists.

2 Likes