Tracking provenance in xarray

There are a couple of “climate services” teams around the world who are looking into provenance tracking within xarray. If you’re not familiar with provenance, or data lineage, it’s essentially a record describing entities and processes involved in the production and delivery of a resource. Provenance documents are typically machine-readable and use a formal syntax (e.g. PROV-O: The PROV Ontology). To be clear, this is not meant to replace human-readable “history” or “comment” attributes.

We started a discussion about this at Provenance tracking using semantic web technologies? · Issue #228 · xarray-contrib/cf-xarray · GitHub and wanted to draw attention to it to get some design and implementation ideas. What I find challenging is that different users, even from the same discipline, will probably want to use different semantics to describe the same operation, or describe it at different levels of details. Also, I wouldn’t want provenance tracking code to obfuscate or break the actual code I’m running.

It’s also not clear how the user interface looks for this. Is this a context manager, e.g.
with provenance(level="INFO"): ..., or should this use some kind of lazy task graph ?

So we’re essentially looking for ideas, comments and suggestions from the community regarding the design, implementation and use of provenance tracking in xarray.

Cheers,

David

8 Likes

One consideration is having a data hash so you can very quickly determine if you are running on the same data as someone else. It is computationally cheap and doesn’t add much data burden.