To @d70-t 's question, I may have changed my mind somewhat. The “data” may be language agnostic, typically just a type and URL or other unique prescription; as opposed to the “reader”, which is specific to the API you intend to call. So an Intake 2 catalogue with various descriptions would still be useful without python. We could implement a similar scheme for non-python readers, but that would require implementations in each of the target languages.
Hi @martindurant, thanks that was very interesting.
I really like the idea of pipelines. Would this work to transparently add meta-data to data? Specifically I’m thinking of xarray
reading netCDF data in the earth system space with Intake-ESM.
CMORisation is a common step for submission to CMIP/ESGF, consequently there are tools that make assumptions that data is CMORised. It would be great if the required metadata, and perhaps other transformations, could be represented as a “CMOR view” into the original dataset.
And yes, we are using intake, and would like to continue doing so. We’re also using the same intake catalog as a data source for some prototype discovery interfaces.
Thanks for the presentation on your thinking about intake 2!
One of the challenges that we have been trying to tackle with intake is a declarative approach to the setup and execution of coastal ocean (mostly numerical) models. A model developer sits squarely in your user definition of “has access to lots of data and needs to manipulate” in order to transform the mostly netcdf-like data sources into bespoke model forcing.
Due to the preponderance of gridded data in this domain xarray is the container of choice for the source data and then this is usually composed with geospatial data (polylines etc) using bespoke code that then outputs the data into the specific input file format for the model to be run.
The general architecture that we have been working towards to modularise the code is roughly:
(1) Data (Intake Catalog) → (2) Intake Driver → (3) quasi-standardized xr.Dataset → (4) Xarray accessor i.e. “ds.swan.to_*()” methods for model specific files.
Many of the mostly simple transformations in (2) should be able to be accomplished with pretty basic calls to the XArray API (i.e. variable renaming, etc) - much like the CMORisation process @aidan described above. For this we have wrapped some basic xarray methods with a dictionary parameterisation as a intake driver argument, but the approach you have taken here is much more generic, alleviating that need.
Whilst it seems that this is bloating another ontology on-top of other library interfaces, where I think this has great utility is that effectively the Intake2 catalog becomes a method to distribute opinionated boilerplate definitions without the need to install yet another library (maybe).
The other possibility that seems pretty cool is the caching aspect, and the possibility of using that to seamlessly checkpoint - wondering if we could point to concrete URIs part-way through a transformation pipeline, and if those exist it reads those instead of re-executing the earlier steps in the pipeline. Not sure about providence here, but maybe a hash of the upstream pipeline could be used to warn the user of inconsistency.
@martindurant Wondering about how you would see the following working:
- I have two catalogs (A, B) and I create catalog (C.entry) with a driver that takes as input say A.ds and B.df - could that be capture declaratively with Intake2?
Finally, thanks for your work on this! and to @rsignell for putting up the recording for those in the antipodes!
Thanks @martindurant for getting back!
Separating data
and reader
could be indeed a nice design decision in this respect. If the data
section is formulated in mostly language agnostic terms (which I guess it is), a data
-only catalog could be something like a baseline service, which can be supported relatively easily across languages. Then there would be the reader
-enabled catalogs, which provide additional service to users which are fine to use Python.
As with Intake 1, a catalog can be an entry in another catalog, so you can already achieve the workflow now in Intake 2, by having catalog C contain YAMLCatalog readers for A and B, and then a third entry that depends on them. There is no automagic way to do it, though, it would involve editing kwargs dicts in place at the moment.
Links to the examples @martindurant presented:
What the rewrite is for (github.com)
Intake 2 examples (github.com)
I should probably make a repo for examples and make sure that they are up to date. I am conscious that the stuff I have shown is pretty technical “this is how things work” rather than end-user “this is how I get things done”. I definitely need to make more of the latter (left see if pydata gives me a talk!).
The other possibility that seems pretty cool is the caching aspect, and the possibility of using that to seamlessly checkpoint - wondering if we could point to concrete URIs part-way through a transformation pipeline, and if those exist it reads those instead of re-executing the earlier steps in the pipeline. Not sure about providence here, but maybe a hash of the upstream pipeline could be used to warn the user of inconsistency.
This test shows a very simple caching strategy. In this case, the condition on whether to load and store the original is simply whether the target cache directory is empty or not. Obviously more complex things could be done. Various storage systems do support unique hashes of data files (or the bytes could be read and hashed/checksummed), but I’m not exactly sure where this information would be stored.
Note that in all of this, the “metadata” that goes along with every stage of a pipeline is not really used for anything. We could store many pieces of provenance and process information there, but that doesn’t really work too well in a YAML file that needs rewriting for any change. Of course, we could store in (for example) a sqlite file locally, an elasticsearch service or somewhere else.
Provenance is a very important consideration. I’m not sure why yaml
isn’t a decent target. It supports multiple documents, so provenance information could be appended as separate documents. The further away from the data/catalog the provenance, the easier it is for the connection to be severed and the provenance lost.
I really really really like the idea of pipelines though. I’m not a fan of duplicating data. It’s wasteful and has a lot of inertia. If there is a problem with any of the constituent data, or the processing pipeline it can be very difficult to get the derived data reprocessed.
It would also lower the barrier to many use cases that would require regridding as an initial step. if you create a pipeline for regridding it would be possible to generate a number of different end-points. Couple that with caching and you have on-demand regridding of products to a number of different targets, caching the well used ones, and unused data can be deleted and regenerated when required. This would be so incredibly useful.
The question is how and where to store the information. In the cache example I link above, the first pipeline (which creates the local parquet files) knows the details of when it was run, and returns the details of that parquet dataset which can include that metadata. However, the other pipeline just references the parquet files’ location. So, running the full pipeline needs to update the definition of the second pipeline within its catalog. That “catalog” may be a YAML file or some other source - writable or not - or maybe fully dynamic (in the test there is no catalog at all).
The alternative is storing the information elsewhere - sidecar files, a DB or something else undefined. Then you need to worry about multiple users and processed and making a unique description of operations.
if you create a pipeline for regridding it would be possible to generate a number of different end-points. Couple that with caching and you have on-demand regridding of products to a number of different targets, caching the well used ones
Yes, we can do that. Whether we would have a smart LRU-like cacher in Intake itself to decide what files to evict when is another matter. fsspec faces a similar problem, although it is caching files rather than datasets.