Environment-agnostic data catalog

Preface: this is a very common problem to which >=1 pangeo users must have an elegant solution, hence my question. Please forgive my misuse of jargon in advance.

Question: like a lot of climate scientists, I have multiple research projects (and associated github repositories) that need to access the same core datasets. Very few of those datasets are cloud-hosted; most are downloads of observational (e.g. HadCRUT5) or simulated data (e.g. CMIP runs) stored on a disk at my university’s high-performance computing center. They are large enough that I don’t want to keep copying them in each repo, but not so large as to require Earthmover services (observational datasets of 10-20 MB; Model output of 1-10 GB scale)

For each repository, I would like to set up a configuration file (yml or similar) that lays out the path to all the data files used in that repository. I assume that’s what a data catalog is?

What I mostly need is a set of pointers to a fixed collection of files that can be imported by my code (e.g. Jupyter notebook) so that it runs on that particular system. Bonus points if there is a way to export these pointers in a way that makes it easy for an external collaborator to replicate the work on their own system, in the same way that an environment.yml file lists package dependencies.

As I said: it’s a very common problem, and there probably is a well-known solution. Grateful for any pointers/ideas.

For something lightweight I often just create a Python file config.py with all the paths and some helpful comments. I’ll import this in my code, which is easier than loading a yaml

Once I start working on multiple systems I’ll modify the config.py to use environment variables instead and then put those in my bashrc or a .env file. See this example cBottle/src/cbottle/config/environment.py at main · NVlabs/cBottle · GitHub

for more complicated cataloging needs people may suggest intake, but that may be overkill for just getting started.

1 Like

I indeed use intake for this. I wouldn’t consider it overkill since you want to use it in multiple projects and the catalogs are easy to set up and share. No server needed or anything.

2 Likes

Thank you both for the pointers. I’m game to try intake, as I may need intake-esm for other projects. @kthyng would you happen to have an example where paths are different on different file systems, so see how I would emulate that?

Hi, +1 for intake, I use it for this as well.

It also did look a little too complicated to me when I first approached it and read the docs, and it took a little bit of reading the source code to understand that it is actually extremely simple and exactly what I needed (and nothing more). Very easily extensible, too.

Also be careful that there are intake v1 and v2 implementations in the same code base, and the docs are weirdly intermingled.

1 Like