My team on the NASA Office of Data Science and Informatics (ODSI) Data Systems Evolution (DSE) program is considering building an HTTP API for statistics generation from Zarr stores and we could use the communities input.
Such an API could be more broadly envisioned as a generalized framework for dimension reduction summaries of multidimensional datasets (thanks to @hrodmn for this description).
Such an API would enable clients and users to do things like zonal statistics and time series generation without having to manage any compute resources.
Would you use a Zarr HTTP API for statistics and how? Why or Why not?
Are there any existing standards in place we would implement?The OGC EDR API is a data retrieval standard we are considering. But it is not an analytics/aggregation engine. The EDR API is an option, but would but a lot of burden on clients to load and compute on that data in-memory. If another standard exists, even if it’s a new standard, we would like to know about it!
Would this just be an API for xarray and isn’t that what xpublish is?xpublish is currently configured per-dataset, not an API which accepts a parameterized an entrypoint URL.
I’ve been working with NOAA SST data and Copernicus Marine Service datasets for personal projects involving coral reef monitoring and oxygen minimum zone visualization. In both cases, fetching and computing statistics required managing the full pipeline locally- ingestion, processing, and aggregation. A parameterized HTTP API for zonal statistics and time series generation would significantly reduce that overhead, especially for users who need quick exploratory analysis without spinning up compute infrastructure. I’d definitely use it, particularly for region-based aggregations over time dimensions
Speak of the devil Matt released datatree support in xpublish this week, and if you squint at the internet the right way, it’s just a really big datatree right?
This will likely need snipe me that I’ll knock up a proof of concept in the next few days, but you would create a dataset provider where each dataset is instead the protocol (http, s3, gcs…) that it supports. Then the rest of the url is treated as the datatree group part of the path, and then whatever plugins you want or create work on top of that.
The APIs are also flexible enough to reshape the path structure if need be, which I’m working on for more conformant OGC support.
Did you consider the openEO API? It should be closer to what you are looking for than the OGC EDR as it has processing/statistics built in and is also based on a concept of data cubes. In CDSE it is already used for that purpose. FWIW, It’s also an OGC community standard since some weeks.
This standard defines a protocol-independent language for the extraction, processing, and analysis of multi-dimensional coverages representing sensor, image, or statistics data.
Unfortunately it basically only has one implementation, and it’s not open source: Rasdaman. The standard is tightly coupled to this implementation, and no real ecosystem exists. This is probably why it never caught on, despite theoretically solving a common problem.
(As a meta point, I think this points to the limitations of the “open standard + proprietary implementation” approach.)
OpenEO is probably a much better choice, although it could be a bit overkill for your use case.
The crux question for this type of application is scale. Is an HTTP REST API the right interface? It’s fine for a single timeseries or point extraction. But what if I want to generate zonal statistics for millions of polygons over a petabyte-scale data cube? For that, the OpenEO concept of a batch job is probably a better fit: the user creates a batch job and then comes back to check on the results later.
@abkfenris we did (are still?) consider(ing) xpublish an option but at this time I think the datasets having to be pre-configured with a deployed instance of the API is not what we are looking for at the moment.
@m-mohr I have to revisit the openEO API, thanks for the suggestion.
@rabernat I will take a look at WCPS as well. But I agree with your meta point that scale is an important consideration for this design. We will likely choose an implementation which for synchronous HTTP response comes with scale restrictions.
Xpublish doesn’t require datasets to be pre-configured. We really don’t consider that best practice, we just have that as the easy way for folks to get started and explore. We probably haven’t been as explicit as we should at recommending folks don’t use that method in production.
I spent nearly as many hours last weekend in a dry suit as I did at home so I didn’t get to hack up an example, and I’m at an IOOS event this week so maybe I can make it happen sometime later next week.
I’m certainly not an xpublish expert yet but it makes sense that xpublish could be adapted to accept dataset entrypoint URLs as a parameter. Don’t build anything on our account! We are still in an exploratory / design phase so having this information alone is helpful.
I consider XPublish to be a framework for service developers, not a standard itself. In that sense, it’s an implementation detail.
I’d break Aimee’s question into two parts:
What is the standard which best maps to this use case? (EDR, WCPS, and OpenEO have all been proposed; there may be others. Or you can always create something new, obligatory XKCD reference, etc.)
What existing implementations or frameworks exist to the chosen standard?
The eager vs. job-submission distinction is quite important because job submission introduces state into the backend. AFAIK Xpublish is stateless (or at least all the state lives in the Xarray datasets themselves), so it would be hard to extend it to support OpenEO’s concept of batch jobs.