Canonical datasets and assurance - do we have an approach?

niall · January 19, 2022, 4:30pm

Do we have a technical approach to dataset assurance? A lot of CMIP data is being uploaded to public could - is there a systematic way to guarantee that the data is the most appropriate version. To me, one of the fundamental appeals of a widely accessible scale-able platform is to enourage canonical data sources which are written once, maintained by an authority, but accessed many times. Do we think this is happening/will happen with ASDI/AWS Earth, Microsoft Planterary Computer etc? How do we ensure that this doesn’t exacerbate the problem of shadow datasets?

I don’t know what I don’t know…but this feels like it could be a potentially really pathological issue. Could we develop a technology for assuring/checking/idiotproofing people using the wrong data? e.g. out of date data

niall · January 19, 2022, 4:31pm

p.s. check me out using discourse

rabernat · January 19, 2022, 7:24pm

Niall this has been discussed extensively at the Pangeo / ESGF cloud data working group. I invite you to review the notes and join that meeting if you’re interested in the topic. Next meeting is this Friday.

https://pangeo-data.github.io/pangeo-cmip6-cloud/working_group.html

niall · January 20, 2022, 9:08am

First off, I do childcare on Fridays so can’t make the session unfortunately. Obviously Matt’s there from the MO so I can catch up with him - we talked about this yesterday.

Thanks for the link to the notes. I had a look through - from what I can see:

On December 10th there was a topic of " Errata stuff - retractions etc." from what I could understand this is mostly about how to make it clear what data has been retracted. But for something to be retracted, presumably someone has to notice it’s wrong in the first place. Or have I misunderstood?

There is also

CEDA & Matt to share info on how data retraction and versioning checks for their data repo occurs. Follow-up with Naomi on the LDEO process. - this will help determine path forward or recommended approach(es)

Which is checked off but I’m unclear if there is a path forward or not.

At the risk of buzzword bingo: could NFTs or similar be a way to guarantee provenance? I understand from Matt that there is already some kind of version checking service (but I can’t remember who’s responsible for it…CEDA?). Perhaps that amounts to the same thing.

Would it be practical to stitch this kind of version check into data access, or recheck the catalogs periodically or something?

rabernat · January 20, 2022, 12:34pm

Not NFT per se, but the concept of content addressable storage would definitely go a long way towards ensuring data integrity, because verification of checksums is ensured every time the data are accessed. CAS is the basis of IPFS, which is a “web3” style of storage and underlies the filecoin ecosystem. We have been working with the folks from protocol labs on and fsspec / ipfs interace:

There has also been a long discussion of this and how it relates to Zarr on the Zarr repo:

github.com/zarr-developers/zarr-python

checksums for chunks

opened 06:48PM - 16 Jan 19 UTC

ttung

#### Problem description Having checksums for individual chunks is good for v…erifying the integrity of the data we're loading. The existing mechanisms for checksumming data are inadequate for various reasons: 1. **Checksum of the entire array's data**: This does not work for loading a subset of the data. 2. **Checksum of each individual chunk recorded by a filter as part of the chunk**: This does not protect against chunks being swapped, and does not help for building a persistent cache for previously read chunks. Recording the checksums in the .zarray file could work, but may be problematic for larger data sets. ---- see also: * https://github.com/zarr-developers/zarr-specs/issues/75 * https://github.com/zarr-developers/zarr_implementations/pull/33#discussion_r616361076

Regarding trust and validation of cloud-based data stores compared to archival data, the biggest development in that space in my mind is Kerchunk. Kerchunk allows us to get Zarr-level performance on the original netCDFs, which presumably have checksums from the original data provider.

niall · January 24, 2022, 3:53pm

that sounds really exciting - thanks for the post.

Topic		Replies	Views
Availability of AWS S3 CMIP6 data? Data	6	1162	February 24, 2021
Looking for the best way to compliment, rather than compete with, this community, but commercially Meta	9	1189	April 7, 2021
CMIP6 catalogue data and missing variables Data	1	86	October 10, 2024
Pangeo Showcase: "FROST: Federated Registry Of Scientific Things" (Feb 12, 2025) Pangeo Showcase	3	568	February 13, 2025
Collecting grid-metric files for CMIP6 output for cloud analysis Cloud	10	1447	June 27, 2022

Canonical datasets and assurance - do we have an approach?

Related topics