Proposal: Expanding the xstac python tool to automate a few more of the hard parts

Hi everyone!

I am a developer/MLOPs person getting increasingly into the geospatial space and loving every second. I’ve done a little bit of work on pystac and am having a blast. I am a bit new, so please excuse the question

I am working on a script that combines some NetCDF data with some STAC data. In looking for tools that do this, I came across this thread which mentions

Write a tool to automatically generate STAC Collections from xarray / zarr / netCDF

which I believe is what prompted the creation of xstac. As a developer I have found the code pretty easy to follow and use (Thanks @TomAugspurger !) but I am wondering if it would be a high value project to extend xstac or build a tool on top of it, which would automate more of the technical details for researcher/data science folks.

For example, it would

  1. Not require you to load files into xarray yourself (would just do this under the hood)
  2. Check for geospatial and temporal dimensions automatically (you could run this tool with a flag that asks if you would like to confirm the dimensions are being interpreted correctly)
  3. If using an existing Collection, validate that dimensions “match” (like temporal, CRS if that extension is used, just a general “we are comparing apples to apples” check)
  4. Perform automatic resolution of small mismatches (like reprojection, etc)
  5. using stackstac, output to other formats (so this could be a netCDF → Zarr bridge that also validates data compatibility)

This sounds fun to build, but my concerns are the following

  1. Would anyone use this? Does it solve a real painpoint or just something I’ve made up as a newbie?
  2. How huge is this scope? I feel like its something I could build and maintain with about 5-10 hours a week of solo work, getting it “done” over the next 1-2 years?
  3. is there something way more pressing? I could just groom issue backlogs on stac tools for a bit longer if I’m jumping the gun here.

To be clear, I am just looking for a gut check/initial thoughts from all of y’all with more experience, no need for a super empirical inquiry :smiley:

1 Like
  1. Not require you to load files into xarray yourself (would just do this under the hood)

xstac did deliberately leave I/O out of Python API, just because there are so many ways to load a dataset. The CLI might provide what you’re looking for but it’s less flexible. Otherwise, a library or tool building on top of xstac would probably be appropriate.

  1. Check for geospatial and temporal dimensions automatically (you could run this tool with a flag that asks if you would like to confirm the dimensions are being interpreted correctly)

The intent (maybe not documented) is to use cf-xarray to automatically load those dimensions. There isn’t a tool on top of it to validate / confirm with the user that it picked correctly though.

If using an existing Collection, validate that dimensions “match”

Do you mean ensuring that the items within a collection are consistent with each other, or with the Collection? There should be a bit of that through the item_assets extension on the Collection. And maybe at the database layer too. I’m not sure exactly where this would go, but more validation is a good thing IMO.

netCDF → Zarr bridge

There’s a lot of lessons learned from pangeo-forge here. It’s a big task. kerchunk / virtualizarr / icechunk are also relevant, especially if keeping the NetCDF files around is desirable.

1 Like

Otherwise, a library or tool building on top of xstac would probably be appropriate.

I have built such a library in stac-recipes, which allows using the building blocks defined by pangeo-forge-recipes to open datasets as xarray objects, use xstac to create a STAC item from that, and write the resulting items to disk (either as a static catalog, or directly to a pgstac database). For some usage examples, see here.

This design should be flexible enough to allow transforming any of the intermediate results, and thus if there are any satellite imagery-specific steps they should be pretty easy to implement.

However, the use of pangeo-forge-recipes implies to use of Apache beam, which is a really interesting piece of technology (for example, it can experimentally figure out the most optimal chunk / batch size), but which is also severely hindered by how hard it is to run it. So I’d expect uptake of this library to be directly linked to the future of pangeo-forge-recipes (which at the moment doesn’t look too well, although there is a way forward, as discussed in this issue).

1 Like