Data pipelining and cataloging best practices (? using intake-xarray to transform and combine data metadata?)

Heyall - I would like to streamline the ingestion of various datasets. I have seen a lot of conversation around intake-xarray, so have started looking into that. One of the problems, however, with the NetCDF data is that it exists in multiple different files that cannot be trivially combined.

Take for example the following catalog:

plugins:
  source:
    - module: intake_xarray
sources:
  Peninsula_U:
    description: Peninsula U velocity
    driver: netcdf
    args:
      urlpath: "{{ CATALOG_DIR }}/data/parcels-examples/Peninsula_data/peninsulaU.nc"
      chunks: {}
      xarray_kwargs:
        engine: "netcdf4"
  Peninsula_V:
    description: Peninsula V velocity
    driver: netcdf
    args:
      urlpath: "{{ CATALOG_DIR }}/data/parcels-examples/Peninsula_data/peninsulaV.nc"
      chunks: {}
      xarray_kwargs:
        engine: "netcdf4"
  Peninsula_mesh: ...

This can’t simply be fixed with a peninsula* in urlpath since the U and V exist on a staggered grid. The dimensions for V (x,y) aren’t the same as the dimensions for U (x,y). Ideally what should happen is that these datasets should be opened, dimensions renamed according to the model padding, xr.Datasets merged, and SGRID metadata attached before being returned to the user.

I have a similar question for unstructured grid datasets (ideally it would be great to be able to open the datasets converting to a uxarray.UxDataset object, but I don’t know whether intake supports this sort of custom code etc.)

My questions:

  • How can this be done in intake-xarray (i.e., can you get these multiple datasets and combine them with custom code - then expose the result using an intake entry)
  • is intake-xarray the current best practice for dealing with this sort of stuff? I’ve heard of STAC catalogs etc. but I don’t know if/how that fits into the picture. intake-xarray hasn’t had a new commit in a couple years, so wasn’t sure if people had moved on.

For now I’m just going to work with these entries and wrapping Python code - very interested to learn about best practices for data pipelining in the xarray ecosystem :slight_smile:

Hi @NickHodgskin - this looks like an interesting use case. If I’m reading it correctly, you are looking for a solution that combines file cataloguing with file processing for a set of files to be analysis-ready. As far as I am aware there isn’t actually a standard solution for this at this point in time. Specifically, at least without looking at the data, I couldn’t say if “dimensions renamed according to the model padding” is possible with intake-xarray. An even more interesting question is how to expose the results?

is intake-xarray the current best practice for dealing with this sort of stuff?

Yes it appears intake-xarrayis not actively maintained. STAC is more actively maintained than intake, to my knowledge, but which to use may be a subjective matter. And the closest approximation to intake-xarray that I’m aware of is stackstac documentation (recognizing there are significant differences). Tagging @gadomski who is more up-to-date with the STAC ecosystem.

1 Like

Thanks for the tag, @aimeeb. @NickHodgskin, I don’t use the xarray library so I can’t answer those questions, but I can help a bit from the STAC side. There are two STAC libraries, odc-stac and stackstac, that create a xarray.Dataset from a list of STAC items. In my experience, odc-stac is more heavily used, so I’d start there — the docs are pretty good.

To one of your specific questions:

I have a similar question for unstructured grid datasets

What do you mean by "unstructured grid”?

as far as I know, intake-xarray has become obsolete with intake>=2, which allows you to define arbitrary readers. I didn’t use intake in years so can’t give advice on how to create a intakev2 catalog, but the docs have guides on how to do that, and you can encode custom transformations (like the renaming of dims, attaching of metadata, or the conversion to UxDataset) in the catalog, as well (cc @martindurant)

With STAC you can’t really encode transformations, all the transformations would need to be encoded in libraries (like odc-stac and stackstac do). You’d just directly point to the files, possibly with all datasets as assets of the same item if this is a model with constant spatial extent.

Another option would be to use virtualizarr to create a combined virtual zarr store (or write directly to zarr if your constrains allow it) where you can tune the metadata (dims, attrs) to your liking, and the catalog (both STAC and intake) can just point to that store instead. In that model, the only transformation step in code that would be necessary would be the conversion to uxarray.

With my limited knowledge on intakev2 I’d say if you need to be able to search (especially spatio-temporally) within the catalog use STAC, otherwise either should work.

Hi all, thank you for the responses - this has been very useful.

With my limited knowledge on intake v2 I’d say if you need to be able to search (especially spatio-temporally) within the catalog use STAC, otherwise either should work.

@keewis could you elaborate here a bit? From my understanding, intake would return an xarray dataset which would have all coordinates etc. loaded in memory hence searching would be efficient? I can understand though that re-opening the dataset repeatedly would be comparatively slow.

What do you mean by "unstructured grid”?

@gadomski

Unstructured grids and why we're interested in them

Unstructured grid data (image) is used by some Ocean General Circulation Models and is defined on a triangular/polygonal mesh as opposed to a structured (i.e., quad defined) or rectilinear mesh. There are metadata conventions ( UGRID conventions) on how to represent this in an xarray dataset, and there are projects ( xugrid and uxarray) that are building tooling to work with these. On the structured side, SGRID conventions have emerged clarifying a way to define staggered grid data. A lot of model output does not have SGRID metadata attached though.

We in Parcels want to support unstructured and structured grids, and also a wide range of models. Having SGRID/UGRID metadata is important for us as it gives us common ground to work from - hence my interest in this data-engineering/processing to make things “analysis ready” before passing to Parcels.

With STAC you can search within the metadata recorded by the catalog, without opening the actual files. As soon as you have opened the dataset there’s indeed no need for the catalog anymore, and also no need to repeatedly reopen files.

This means that STAC is best suited if you have many files with different spatio-temporal extents, while for model data there’s not too much of a difference between the two

2 Likes

Maybe a bit off-topic but

This can’t simply be fixed with a peninsula* in urlpath since the U and V exist on a staggered grid. The dimensions for V (x,y) aren’t the same as the dimensions for U (x,y). Ideally what should happen is that these datasets should be opened, dimensions renamed according to the model padding, xr.Datasets merged, and SGRID metadata attached before being returned to the user.

I think this could be achieved by virtualizing the data? I think as long as you load the dimension coordinates into memory and write them out to e.g. native icechunk chunks, the renaming should work, and you could also add any sort of sgrid metadata? But again, maybe I am off topic a bit.

1 Like