Preface: this is a very common problem to which >=1 pangeo users must have an elegant solution, hence my question. Please forgive my misuse of jargon in advance.
Question: like a lot of climate scientists, I have multiple research projects (and associated github repositories) that need to access the same core datasets. Very few of those datasets are cloud-hosted; most are downloads of observational (e.g. HadCRUT5) or simulated data (e.g. CMIP runs) stored on a disk at my university’s high-performance computing center. They are large enough that I don’t want to keep copying them in each repo, but not so large as to require Earthmover services (observational datasets of 10-20 MB; Model output of 1-10 GB scale)
For each repository, I would like to set up a configuration file (yml or similar) that lays out the path to all the data files used in that repository. I assume that’s what a data catalog is?
What I mostly need is a set of pointers to a fixed collection of files that can be imported by my code (e.g. Jupyter notebook) so that it runs on that particular system. Bonus points if there is a way to export these pointers in a way that makes it easy for an external collaborator to replicate the work on their own system, in the same way that an environment.yml file lists package dependencies.
As I said: it’s a very common problem, and there probably is a well-known solution. Grateful for any pointers/ideas.
For something lightweight I often just create a Python file config.py with all the paths and some helpful comments. I’ll import this in my code, which is easier than loading a yaml
I indeed use intake for this. I wouldn’t consider it overkill since you want to use it in multiple projects and the catalogs are easy to set up and share. No server needed or anything.
Thank you both for the pointers. I’m game to try intake, as I may need intake-esm for other projects. @kthyng would you happen to have an example where paths are different on different file systems, so see how I would emulate that?
It also did look a little too complicated to me when I first approached it and read the docs, and it took a little bit of reading the source code to understand that it is actually extremely simple and exactly what I needed (and nothing more). Very easily extensible, too.
Also be careful that there are intake v1 and v2 implementations in the same code base, and the docs are weirdly intermingled.