Environment-agnostic data catalog

CommonClimate · August 9, 2025, 12:33am

Preface: this is a very common problem to which >=1 pangeo users must have an elegant solution, hence my question. Please forgive my misuse of jargon in advance.

Question: like a lot of climate scientists, I have multiple research projects (and associated github repositories) that need to access the same core datasets. Very few of those datasets are cloud-hosted; most are downloads of observational (e.g. HadCRUT5) or simulated data (e.g. CMIP runs) stored on a disk at my university’s high-performance computing center. They are large enough that I don’t want to keep copying them in each repo, but not so large as to require Earthmover services (observational datasets of 10-20 MB; Model output of 1-10 GB scale)

For each repository, I would like to set up a configuration file (yml or similar) that lays out the path to all the data files used in that repository. I assume that’s what a data catalog is?

What I mostly need is a set of pointers to a fixed collection of files that can be imported by my code (e.g. Jupyter notebook) so that it runs on that particular system. Bonus points if there is a way to export these pointers in a way that makes it easy for an external collaborator to replicate the work on their own system, in the same way that an environment.yml file lists package dependencies.

As I said: it’s a very common problem, and there probably is a well-known solution. Grateful for any pointers/ideas.

nbren12 · August 10, 2025, 4:16pm

For something lightweight I often just create a Python file config.py with all the paths and some helpful comments. I’ll import this in my code, which is easier than loading a yaml

Once I start working on multiple systems I’ll modify the config.py to use environment variables instead and then put those in my bashrc or a .env file. See this example cBottle/src/cbottle/config/environment.py at main · NVlabs/cBottle · GitHub

for more complicated cataloging needs people may suggest intake, but that may be overkill for just getting started.

kthyng · August 11, 2025, 11:34pm

I indeed use intake for this. I wouldn’t consider it overkill since you want to use it in multiple projects and the catalogs are easy to set up and share. No server needed or anything.

CommonClimate · August 18, 2025, 8:42pm

Thank you both for the pointers. I’m game to try intake, as I may need intake-esm for other projects. @kthyng would you happen to have an example where paths are different on different file systems, so see how I would emulate that?

vladidobro · August 19, 2025, 5:58am

Hi, +1 for intake, I use it for this as well.

It also did look a little too complicated to me when I first approached it and read the docs, and it took a little bit of reading the source code to understand that it is actually extremely simple and exactly what I needed (and nothing more). Very easily extensible, too.

Also be careful that there are intake v1 and v2 implementations in the same code base, and the docs are weirdly intermingled.

Topic		Replies	Views
Pangeo Cloud Data Cookbook Cloud	5	1419	March 25, 2021
Google storage gs:// URLs for Pangeo datasets on GCS Cloud	1	847	October 26, 2020
Sep 27, 2023: "Intake 2: The Future", Martin Durant Pangeo Showcase	10	923	October 4, 2023
Cleaning out the pangeo-data google cloud storage bucket Cloud	27	2812	February 5, 2020
What's Next — Data management	4	556	December 12, 2023

Environment-agnostic data catalog

Related topics