Tool for validating geo data/services moved to the cloud?

At the data center I work at within NASA, and I am sure throughout the other NASA data centers, we either have or are currently developing lots of tooling that is validating data once it is moved to either the cloud, moved as a cloud-optimized format or testing a service (like subsetting) that has also moved to the cloud. Validation can mean many things: access as expected, catalogued as expected… but also the data itself should be the “same”, typically using an on-prem copy of the data or a service pointing to an on-prem copy as the ‘truth’.

Is there some tool we’re missing from the community that can quickly compare data?

I brought this up in some pangeo-forge threads and had a huddle with @aimeeb and @sharkinsspatial to see if pangeo-forge itself has some of this functionality. Right now, the answer is no. Just want to brainstorm with a larger group of folks.

4 Likes

Thanks for starting this thread! I’ve heard similar questions, specifically around the cloud-optimized conversion process.

Is there some tool we’re missing from the community that can quickly compare data?

For truly lossless conversions, then maybe xr.testing.assert_equal is that tool? And using it everywhere in projects like pangeo-forge that are doing these conversions?

I’m interested in using hashes / checksums of the data in many more places and propagating that through in whatever metadata is attached to the data (either in the file metadata like Zarr and NetCDF, or in the external metadata). For a lossless conversion like NetCDF to Zarr with the same chunking structure, the checksum of the uncompressed data should be the same.

If you’re changing the actual data somehow (reprojecting, say; maybe rechunking?) then the checksum approach falls apart, I think.

On the STAC side, the file extension has some details around metadata for file checksums. This is great if you’re just copying the data, but doesn’t help if you’re doing any kind of (even lossless) cloud optimization.

3 Likes

I think this datatree identical() function will be quite useful for many validation cases.

1 Like

I like the idea of preserving the checksum somewhere within metadata. I’ve been thinking about validation of data transformation services a lot recently, and this conversation sparked a thought:

Like @briannapagan mentioned right now we are linking back to on-prem data and services as our truth for validating cloud-hosted services. Eventually, these on-prem services will go away and we will have to run regression testing of new code against older (validated) versions of those services.

Right now our validations involve a lot of (time consuming) assert_equal() across all the data in a subsetted file against the “truth”. What if, during the data transformation processes within the service, checksums were appended to the metadata of the file. Or perhaps a checksum of each variable (data-only, so that small changes to metadata e.g. production timestamp don’t affect it). Then, instead of having to store golden files or run legacy code for validation, we just keep track of the types of tests we’re doing and the expected checksum result(s) for quick comparison.

4 Likes

Great question - I’ve been thinking about this problem recently in the context of Kerchunk.

Kerchunked datasets have a fundamental risk in that you create a virtual zarr store that refers back to some archival files, but if any of those files are altered or moved, your zarr store becomes different/corrupted/invalid with no warning.

Really the kerchunk data should include some information with it that allows you to verify that the files it points to have not been changed. You wouldn’t want to perform this check on every access, but you might want to perform it after some update, or after moving it. This could take the form of storing a checksum for each chunk alongside the path to the chunk in a zarr manifest file, for example (see the zarr chunk manifest ZEP idea):

# inside a zarr store there is a file
# a/foo/manifest.json
{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100, "checksum": e0e863980a465d40c3131a22eb3a306d5e19883793c5ca8a2ca642dea94e7123},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100, "checksum": c1077dca18dbefa136006a398e0c22951f0abbb7fbe5b50c6da6654d1a6571a3},  
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100, "checksum": 181272f95f1b3df932ce1ac39819ea16ef63073d9b619cac15c870b789120341},  
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100, "checksum": cead9997448adbabfed395429f938a78736c0b6b3586ab22d2a9403373991b56}, 
}

The extreme case of this would be a system that can check and guarantee integrity after any chunk is changed, and expects chunks to be changed regularly. That would be a database of chunks, and is the problem arraylake is meant to solve.

Is there some tool we’re missing from the community that can quickly compare data?

I think that for archival datasets that aren’t supposed to be changed (or only changed rarely) a database is overkill, and some small standalone tool that facilitates calculating checksums on large netCDF/zarr datasets could be useful.

If your compression is lossy then it’s trickier, but at least xarray.testing.assert_allclose() exists.

2 Likes

I think that we are raising several interconnected problems here. (If the formatting on this turns out wonky, I apologize in advance. I’m not sure how to preview the formatting!)

Permanent data copies/transformations
Sometimes the original archive data is not suitable for certain applications and it makes sense to transform or enhance the original data in some way. If the original archive data is really large or the transformation is computationally intensive, it may make sense to create a permanent copy of the transformed data, which can then be used repeatedly.

Validation then falls into two categories:

  1. Validating the initial transformation of a dataset from one format to another. In my case, this would be converting original archived data into zarr stores. But I would include kerchunk metadata in this category because it’s creating a new virtual data store.
  • You could validate every single point of transformed data, but this may be impractical if the dataset is very large or if access to either the original data or transformed data is slow and/or computationally intensive. Validating a reasonable subset of data values is probably more realistic? But what is “reasonable?”
  • You could calculate some kind of checksum or checksums, but the exact checksums you calculate might have to be done differently depending on the transformation. If I’m just copying data from one place to another, then a checksum on each file makes sense. If I’m transforming the data into a new file format, checksums on the original data aren’t necessarily helpful in this context. The checksums would need to be calculated somehow on the data arrays (and metadata?) themselves, potentially. But this may also not work if the transformation does something additional, such as subsetting or changing the data type. E.g. - one old form of compression that is used in NetCDF files is to express your floating point data array as integer values that must be transformed at run time into floating point values by using a scale factor and offset. Calculating a checksum over the integers in the original data won’t be any help if the transformed data has already applied the scale factor and offset.
  1. Ensuring the transformed data stays valid. In my case, this would include adding and updating zarr stores as new archival data is published. The kerchunk case is essentially the same because the kerchunk metadata will also need to be updated as the archived data changes.
  • You could validate the every value in the entire store every time the data is updated, but this is probably overkill. Even double checking all the updated data may be difficult, however, depending on how much data is touched by the dataset update.
  • You may struggle to detect that changes have occurred. In my zarr case, being slightly out of date may not be terrible. As long as the zarr store is updated within a reasonable time frame, that may be sufficient, particularly if the zarr store has some kind of lineage information to help users determine how up to date the dataset is. Kerchunk, however, is much trickier. Because kerchunk is used in conjunction with the original archived data, it could start producing some very strange outputs if the archived data changes. In this case, keeping checksums on each of the original archive files may be the way to go. It may not be 100% necessary in all cases, though. In some well designed datasets, the original archive file names always change when a file is replaced because the filename contains a production time stamp. In this case, replacement granules would cause kerchunk metadata to refer to suddenly non-existent files, which would at least be simple to detect.

On demand data copies/transformations
Sometimes it makes more sense to transform the data on demand. If the transformation involves smaller datasets or less computation, then on-demand services are great because you can offer a variety of options to suite more use cases. As a bonus, you don’t have to worry about out-of-date transformed data because it is created fresh whenever needed.

But we still need validation:

  1. Validating the transformation service is working as expected at roll out.
  • On-demand services tend to be parameterized, so it is probably impractical to validate every possible output of the service. Again, you need a reasonably subset of outputs. How do you figure out “reasonable?”
  1. Validating that updates to the transformation service (new features, library updates, etc.) haven’t broken anything.
  • @lsterzinger mentions “golden data” or keeping some other record of previously validated output. I kind of love the idea of keeping a checksum of the validated output rather than a copy of the original outputs. The downside, however, is that a checksum won’t necessarily tell you how the data changed. And you are sunk if you want to change the output format of the service somehow. E.g. - if you want to add more header information, your golden checksums may or may not be useful any more.
1 Like

Some thoughts on checksumming here

Ideally the checksum for data should be independent of the choice of chunking / compression, which is definitely not the case for traditional file-based checksumming. So we need to invent a new algorithm for this.

Also see this package