What's Next — Data management

Then I’ll start the discussion by describing what I think of as open questions for each of the topics I have opinions on (I’m mostly working on an HPC, so no opinions on how to best move data from HPC to a cloud bucket). Note that since these are based on my personal experience I’m most likely missing a lot, so feel free to add to that / correct me if I’m getting anything wrong.

catalogs

(my view on this is heavily influenced by STAC, and I can’t really comment on other catalog formats / implementations)

I would like to use catalogs for three main things: data discovery, finding specific items in a dataset, and looking up file paths / urls to actually access the data (along with certain additional settings).

With data discovery I mean something along the lines of “I’m looking for a data with these variables / physical quantities / properties, tell me everything you’ve got”, which would correspond to looking up collections in STAC. This is usually easier to accomplish using a graphical browser.

Once I know which dataset I want to use, I might want to narrow it down to just a small subset. This depends on the on-disk format of the data: if it is split into lots of different files (examples would be satellite imagery or in-situ data), I might want to query the catalog for those that match some criteria. If the dataset is in a single store, this would have to be done after opening the data.

Finally, having file paths / urls of datasets in application code usually is very messy, so I’d like to be able to read those from the catalog.

Most catalog implementations already support most of these in principle, so the question I have is: how do we represent the datasets we’re typically working with in a catalog to support these different tasks?

Since it is the main focus of STAC there’s a lot of work on this for satellite imagery, but I am not aware of guidance for model or in-situ data (although maybe I didn’t look hard enough). Representing in-situ data in a STAC catalog is only partially solved / not standardized, but in any case I’m hoping to have a collection of resources on how to best do this. I know @TomAugspurger has done some work on this with xstac, but even for that you still need to create a template.

Related to that, how do we best structure the workflow that takes the data files / store and creates the catalog (might be STAC specific, as well)?

I’d be particularly interested to see if it makes sense to take some of the openers / transforms from the post-beam refactor pangeo-forge-recipes package and combine them with catalog specific transforms like a CreateXstacItem transformer.

efficient access to archival formats

I believe most of this can be resolved by pointing towards kerchunk, although maybe there are data formats that can’t make use of it (because they are not yet supported, or there’s a technical reason why it can’t be) or datasets where the individual files can’t be stacked to form a bigger dataset?

3 Likes