What's Next — Data management

Yet another follow-up to the showcase discussion on Dec 6

This topic is focused on data management. In the meeting notes we so far have three subtopics:

  • cataloging of data
  • data transfer of data from HPC to the cloud
  • efficient access to archival formats

(feel free to correct me if I misunderstood anything)

Do we agree that these are the topics we want to push forward? Is there anything important that we missed in preparation of / during the showcase yesterday?

In any case, I think we should have a bunch of related threads for each of these subtopics to keep the discussion focused.

@rabernat @norlandrhagen @TomAugspurger @jrbourbeau @martindurant @jmunroe @maxrjones @cspencerjones @dcherian @Thomas_Moore

3 Likes

@betolink, @TomNicholas (split off because I can only ping 10 users at once)

I agree with these topics, let’s get going.
(I don’t have the use of an HPC system, so maybe less useful there)

1 Like

Then I’ll start the discussion by describing what I think of as open questions for each of the topics I have opinions on (I’m mostly working on an HPC, so no opinions on how to best move data from HPC to a cloud bucket). Note that since these are based on my personal experience I’m most likely missing a lot, so feel free to add to that / correct me if I’m getting anything wrong.

catalogs

(my view on this is heavily influenced by STAC, and I can’t really comment on other catalog formats / implementations)

I would like to use catalogs for three main things: data discovery, finding specific items in a dataset, and looking up file paths / urls to actually access the data (along with certain additional settings).

With data discovery I mean something along the lines of “I’m looking for a data with these variables / physical quantities / properties, tell me everything you’ve got”, which would correspond to looking up collections in STAC. This is usually easier to accomplish using a graphical browser.

Once I know which dataset I want to use, I might want to narrow it down to just a small subset. This depends on the on-disk format of the data: if it is split into lots of different files (examples would be satellite imagery or in-situ data), I might want to query the catalog for those that match some criteria. If the dataset is in a single store, this would have to be done after opening the data.

Finally, having file paths / urls of datasets in application code usually is very messy, so I’d like to be able to read those from the catalog.

Most catalog implementations already support most of these in principle, so the question I have is: how do we represent the datasets we’re typically working with in a catalog to support these different tasks?

Since it is the main focus of STAC there’s a lot of work on this for satellite imagery, but I am not aware of guidance for model or in-situ data (although maybe I didn’t look hard enough). Representing in-situ data in a STAC catalog is only partially solved / not standardized, but in any case I’m hoping to have a collection of resources on how to best do this. I know @TomAugspurger has done some work on this with xstac, but even for that you still need to create a template.

Related to that, how do we best structure the workflow that takes the data files / store and creates the catalog (might be STAC specific, as well)?

I’d be particularly interested to see if it makes sense to take some of the openers / transforms from the post-beam refactor pangeo-forge-recipes package and combine them with catalog specific transforms like a CreateXstacItem transformer.

efficient access to archival formats

I believe most of this can be resolved by pointing towards kerchunk, although maybe there are data formats that can’t make use of it (because they are not yet supported, or there’s a technical reason why it can’t be) or datasets where the individual files can’t be stacked to form a bigger dataset?

3 Likes

There is definitely work to do! It would be nice to know which formats people need that are not supported, and any missing features of formats that we do already consider.

For aggregating to datasets, var-chunks in zarr is the biggest blocker, with careful consideration of trees/hierarchies (or other non-netCDF layouts) second.

Much of kerchunk work has been motivated by a fairly small number of specific use cases…