Tracking provenance in xarray

huard · May 25, 2021, 8:26pm

There are a couple of “climate services” teams around the world who are looking into provenance tracking within xarray. If you’re not familiar with provenance, or data lineage, it’s essentially a record describing entities and processes involved in the production and delivery of a resource. Provenance documents are typically machine-readable and use a formal syntax (e.g. PROV-O: The PROV Ontology). To be clear, this is not meant to replace human-readable “history” or “comment” attributes.

We started a discussion about this at Provenance tracking using semantic web technologies? · Issue #228 · xarray-contrib/cf-xarray · GitHub and wanted to draw attention to it to get some design and implementation ideas. What I find challenging is that different users, even from the same discipline, will probably want to use different semantics to describe the same operation, or describe it at different levels of details. Also, I wouldn’t want provenance tracking code to obfuscate or break the actual code I’m running.

It’s also not clear how the user interface looks for this. Is this a context manager, e.g.
with provenance(level="INFO"): ..., or should this use some kind of lazy task graph ?

So we’re essentially looking for ideas, comments and suggestions from the community regarding the design, implementation and use of provenance tracking in xarray.

Cheers,

David

Topic		Replies	Views
Wednesday February 1st: Xarray-Datatree: Hierarchical Data Structures for Multi-Model Science Pangeo Showcase	0	546	February 27, 2023
First 2023 Pangeo showcase at the Feb 1 community meeting! News & Announcements	1	1029	January 27, 2023
Cloud array storage solutions Data	3	1114	November 29, 2023
Large-scale data processing benchmarks for Xarray-Beam	6	1533	June 13, 2022
Sep 27, 2023: "Intake 2: The Future", Martin Durant Pangeo Showcase	10	787	October 4, 2023

Tracking provenance in xarray

Related topics