STAC catalogs for insitu data

keewis · March 30, 2023, 12:12pm

Hi all,

I’m currently trying to create a STAC catalog for insitu data, for which I couldn’t find any examples I could copy. The kind of data I’m working with (insitu data is pretty heterogenous, so I might be looking at the simpler datasets) is timeseries-like:

scalar lat / lon positions evolving over time (if there are indeed positions, some datasets only have coordinates for deployment and retrieval, i.e. the position of the first and last valid measurement)
data variables that depend on time, and maybe other dimensions like depth (for profilers) or range (radars)
lots of dataset-level attributes

I didn’t look too deeply into this yet, but so far it seems like the only tricky part is the spatial extent. Naively, I would represent that with a LineString geometry and a bounding box. However, just using the positions from the dataset seems like it could result in geometries that are too detailed and thus too big (so we’d have to simplify those somehow). Simply using the bounding box as geometry does not work too well, either, because we might mistakenly select a item that has no data in the search region. And finally I’m wondering how to best represent the case where we only have the coordinates of the first (and maybe the last) valid measurement.

Does anyone have any ideas about this? It seems STAC catalogs for satellite altimetry data might be very similar in that regard, so maybe I can take inspiration from that? Are there any examples I could have a look at?

I did also ask this in a more condensed form on the STAC gitter channel, I’ll post here if I get an answer there.

cc @abkfenris, with whom I had a brief discussion during the weekly check-in yesterday.

strobpr · March 30, 2023, 2:29pm

Thanks @keewis, this is a really interesting and hot topic! Among data providers (I work for Copernicus) STAC is increasingly realized as a key tool for data ‘Findability’. At the same time the increased use is naturally leading to an increased number of extensions and I would like to understand what would be an appropriate structure to which all these (future) extensions fit. That said, I’m not a STAC expert, but are rather interested in the taxonomy or ‘types’ of data and how to best characterize and cover all of them. So, very keen to follow this.
A question if I may: when you write about ‘insitu’ you are probably aware that there are very different definitions (and spellings: ‘in situ’, ‘in-situ’) out there. Which particular type of data do you have in mind, what characterizes them?

TomAugspurger · March 30, 2023, 3:57pm

We’ll want to update the best practices section with the result of this discussion.

@gadomski has thought about this a bit I think.

I think the main things to balance are

Very precise geometries slowing down item loading (from disk / database) and serving (over the network)

Anecdotally, our io-lulc dataset has pretty high-resolution geometries and still feels quick

How precise geometries will affect search: Having less-precise geometries (that still cover the item entirely) will mean users can be less precise in their searches and still get back matching items. That might be good or bad, depending on what your users want.

In general, we try to make sure that the geometry completely covers the footprints of the assets, but we simplify things so they aren’t too big (where “too big” is subjective).

The stactools raster footprint calculation discusses this a bit too: stactools/raster_footprint.py at ac46504e66bf8f0f257e9775a6d3e811e0093c6f · stac-utils/stactools · GitHub

gadomski · March 30, 2023, 9:06pm

Do you have some representative example data we can look at? I find that it’s often useful to start with the asset (i.e. the data file or files you’re hosting) when building a new STAC item structure.

Naively, I would represent that with a LineString geometry and a bounding box.

This seems like a good and reasonable place to start.

However, just using the positions from the dataset seems like it could result in geometries that are too detailed and thus too big (so we’d have to simplify those somehow).

Agreed. In stactools, we’ve created a function that does just this: API Reference — stactools 0.5.3 documentation. We’ve found that the “correct” simplification varies from dataset to dataset.

And finally I’m wondering how to best represent the case where we only have the coordinates of the first (and maybe the last) valid measurement.

Again, I think this depends on the type of data you’re representing. In the case where you only have a first and last, I could see a useful geometry being a polygon that’s a buffered rectangle around the line joining the two points. But, per usual, it depends.

It seems STAC catalogs for satellite altimetry data might be very similar in that regard, so maybe I can take inspiration from that? Are there any examples I could have a look at?

I don’t know of any, sorry.

keewis · April 3, 2023, 10:28am

thanks, that helps quite a bit already.

You can have a look at the data here (raw data access through a HTTP server), and I also have some non-public data from biologging, which is basically a temperature / pressure log over time with the mentioned start / end positions and some additional metadata. If the STAC catalog for those turns out to work well, I might also look at a catalog for argo float.

I could see a useful geometry being a polygon that’s a buffered rectangle around the line joining the two points

An additional detail is that the position is literally unknown, as the tagged fish might travel long distances. Since the last position might not be available for all datasets, I think we’d either need two geometry fields, or I just use first as a point geometry. AFAICT, the former is not supported by STAC, and since searching for the last position does not make too much sense in this particular case, I think I will go with the latter.

we’ve created a function that does just this

If I understand that correctly, it will extract the shape of the data, densify the result, then reproject and simplify. I don’t think I can directly apply that to the point / line-shaped geometries I have, but combined with the recommendation to allow a bit more fuzzy searches, maybe computing a buffer around the original geometry, then simplifying the result might work?

Which particular type of data do you have in mind, what characterizes them?

I’m sorry, I don’t know enough to be able to answer this question, but hopefully you can figure that out by looking at the data (and since that’s cmems you might already know about this particular collection of data?).

gadomski · April 3, 2023, 12:37pm

You can have a look at the data here (raw data access through a HTTP server), and I also have some non-public data from biologging, which is basically a temperature / pressure log over time with the mentioned start / end positions and some additional metadata.

Thanks. Without too much domain knowledge to go from, I downloaded a couple of files and checked them out. All the ones I found had latitude, longitude, and time coordinates, which should map just fine to STAC. IMO a (possibly simplified) LineString makes sense for the files I looked at.

use first as a point geometry

Yup, this makes sense to me. Hopefully the other fields on the items (e.g. datetime, extensions, and any domain-specific properties) would make these items searchable for users, in liu of rich geometry information.

it will extract the shape of the data, densify the result, then reproject and simplify.

Each part of the process is contained in its own function, so you could just use the simplify function; that’s a pretty thin wrapper around shapely’s simplify, so you could just use that directly as well.

keewis · August 17, 2023, 12:18pm

Using the comments above I have been able to create collections for each data category in monthly from the repository above, and searching that catalog yields the expected results (for most categories). I’ll post a link to the generating script once I’m done cleaning it up.

For now I also decided that for the monthly aggregates the sensor does not move around too much. However, for some sensor types like the thermosalinometers (the TS category, as far as I understand it those are sensors on a boat) this assumption may not be ideal even for “monthly”: they move around quite a bit, so aggregates might cover the entire globe, despite individual months only covering smaller regions. As a result, since by default time and space are treated as orthogonal the current STAC catalog may return items that contain no data whatsoever that match the search criteria.

To solve this, I wonder if it would be possible to create a STAC extension that would make use of the OGC moving feature standard and its GeoJSON-based encoding extension (MF-JSON)?

Topic		Replies	Views
STAC and Earth Systems datasets Data	23	4884	October 24, 2022
Metadata duplication on STAC zarr collections Data	6	1673	March 27, 2024
Data catalogs (and a bit of data engineering such as datacube, STACs) and Google earth engine Data	7	986	May 8, 2023
Pangeo Showcase: "High-performance Python STAC tooling, backed by Rust" (Feb 5, 2025) Pangeo Showcase	8	658	February 13, 2025
How to build a STAC out of ERA5-land data? Data	3	132	May 28, 2025

STAC catalogs for insitu data

Related topics