Canonical datasets and assurance - do we have an approach?

Do we have a technical approach to dataset assurance? A lot of CMIP data is being uploaded to public could - is there a systematic way to guarantee that the data is the most appropriate version. To me, one of the fundamental appeals of a widely accessible scale-able platform is to enourage canonical data sources which are written once, maintained by an authority, but accessed many times. Do we think this is happening/will happen with ASDI/AWS Earth, Microsoft Planterary Computer etc? How do we ensure that this doesn’t exacerbate the problem of shadow datasets?

I don’t know what I don’t know…but this feels like it could be a potentially really pathological issue. Could we develop a technology for assuring/checking/idiotproofing people using the wrong data? e.g. out of date data

2 Likes

p.s. check me out using discourse :stuck_out_tongue:

1 Like

Niall this has been discussed extensively at the Pangeo / ESGF cloud data working group. I invite you to review the notes and join that meeting if you’re interested in the topic. Next meeting is this Friday.

https://pangeo-data.github.io/pangeo-cmip6-cloud/working_group.html

1 Like

First off, I do childcare on Fridays so can’t make the session unfortunately. Obviously Matt’s there from the MO so I can catch up with him - we talked about this yesterday.

Thanks for the link to the notes. I had a look through - from what I can see:

On December 10th there was a topic of " Errata stuff - retractions etc." from what I could understand this is mostly about how to make it clear what data has been retracted. But for something to be retracted, presumably someone has to notice it’s wrong in the first place. Or have I misunderstood?

There is also

CEDA & Matt to share info on how data retraction and versioning checks for their data repo occurs. Follow-up with Naomi on the LDEO process. - this will help determine path forward or recommended approach(es)

Which is checked off but I’m unclear if there is a path forward or not.

At the risk of buzzword bingo: could NFTs or similar be a way to guarantee provenance? I understand from Matt that there is already some kind of version checking service (but I can’t remember who’s responsible for it…CEDA?). Perhaps that amounts to the same thing.

Would it be practical to stitch this kind of version check into data access, or recheck the catalogs periodically or something?

Not NFT per se, but the concept of content addressable storage would definitely go a long way towards ensuring data integrity, because verification of checksums is ensured every time the data are accessed. CAS is the basis of IPFS, which is a “web3” style of storage and underlies the filecoin ecosystem. We have been working with the folks from protocol labs on and fsspec / ipfs interace:

There has also been a long discussion of this and how it relates to Zarr on the Zarr repo:

Regarding trust and validation of cloud-based data stores compared to archival data, the biggest development in that space in my mind is Kerchunk. Kerchunk allows us to get Zarr-level performance on the original netCDFs, which presumably have checksums from the original data provider.

1 Like

that sounds really exciting - thanks for the post.