Pangeo OpenEO Backend?

rabernat · August 21, 2020, 2:44pm

openEO develops an open API to connect R, Python, JavaScript and other clients to big Earth observation cloud back-ends in a simple and unified way.

This project has quite a bit of traction, particularly in Europe. It would be great to try to connect Pangeo with OpenEO. I think the path forward on this would be to implement–or contribute to–an OpenEO “backend”, which can process and serve data to users via an API.

The API is specified here:

https://api.openeo.org/
https://processes.openeo.org/ (processing API)

There seems to be a python backend already started here:

This uses some xarray and dask, but I’m not sure it takes full advantage of the Pangeo cloud stack (zarr, dask distributed, xarray, etc.).

I am posting this issue to gather thoughts about how we could work with OpenEO. Is anyone in our community also involved in OpenEO? What would be the best way for us to have an impact?

martindurant · August 25, 2020, 1:52pm

Sounds like an intake-able API.
I didn’t follow, from a cursory reading, how or where processing happens. I wonder at yet another graph-like representation of computations (otoh: people do not serialise xarray/dask pipelines for recall).

m-mohr · October 20, 2020, 10:48am

Hi Ryan,

I just want to quickly let you know that we plan to let students write an openEO Pangeo back-end this winter term. That will likely not be production ready in any form, but would give us a first impression to base further work on. Any use case you would find interesting or anything you are particularly interested in regarding pangeo/openEO?

There are several of Python back-ends, but openeo-processes-python is just implementing the processes itself, not the HTTP API. I don’t think the openeo-processes-python take full advantage of pangeo yet, but that could also be explored in the future and any help is highly appreciated.

Best,
Matthias

rabernat · October 20, 2020, 1:22pm

That’s awesome news @m-mohr! We would love to help out on that project however we can, e.g. helping to define best practices or debug code.

A couple of random thoughts:

In Pangeo world, there are already two DAG-based execution engines: Dask and Prefect. (For example, right now in Pangeo, we can write xarray code which generates a Dask graph for delayed execution and then execute it with a distributed scheduler. This seems conceptually close to what OpenEO is aiming for already.) So it would be ideal if an OpenEO backend could simply translate the OpenEO workflow to one of these existing formats and then pass the execution off to an existing, mature task scheduler.
Xarray has become our de-facto universal API for data analysis. Xarray’s API is similar to, but distinct from, the openEO python API. Pangeo users would likely love to take advantage of OpenEO backend processing, but they probably don’t want to learn a new API. Can we somehow generate OpenEO API calls from vanilla Xarray code? This could be hard, since they use different types of abstraction. (The integration point in Xarray is with the NumPy API, which is implemented by many array libraries, e.g. NumPy itself, Dask, cupy, etc.)
Xcube seems like a really cool project. It already provides a REST API and CLI for interacting with xarray datacubes, and it’s part of the ESA ecosystem. Could Xcube be leveraged here, rather than starting from scratch?

m-mohr · October 26, 2020, 10:23am

Thanks @rabernat, appreciate the offer and your thoughts. We’ll likely get back to it.

It’s likely that we’ll translate into an existing format for a task scheduler. We usually don’t implement that on our own in openEO.
I guess that’s best discussed with the guys implementing the Python client as I would imaging that being done on the Python client level due to the fact that xarray is only known in Python world and would not benefit R/JS users so much. Maybe there’s room for further alignment. I doubt we can fully align, but maybe make it easier for users to learn the new API?
I have no clue, but that’s likely a point the students can investigate.

geynard · July 6, 2022, 1:35pm

I everyone,

I was going to open a thread on Pangeo and OpenEO when I found it was already opened almost 2 years back! Awesome.

As @rabernat said, there’s quite a bit of traction in Europe towards OpenEO, but towards Pangeo too. At CNES and other places, we’re trying to see if the two approaches could work together. As I’m really new to OpenEO concepts (only read the about page), I’ve not anything to add to what Ryan suggested.

Has there been any advances in the subjects discussed here that people know about?

Maybe @pl.marasco or @annefou have some thoughts if they saw some talks at the recent ESA Living Planet Symposium?

pl.marasco · July 14, 2022, 8:20am

Hi everyone,

Indeed OpenEO in Europe, for many reasons, can be seen as a trading technology and, IMHO, within a couple of years the European ‘market’ will be flooded by it and, more specifically, by the OpenEO platform.

@geynard like you I’m indeed a newbie on this topic. I attended a couple at the LPS but all of them were more focused to present the platform than on the API; as the LPS has been quite dispersive maybe I missed the more focused one.

Indeed, from the OpenEO developers, I had the feeling that there is interest in having a dialogue with Pangeo developers but I’m not aware of any initiative.

TomAugspurger · July 14, 2022, 11:39am

Just cross linking to An Pangeo/ODC-based backend that runs "out of the box" · Issue #16 · Open-EO/PSC · GitHub, where the Open-EO community is discussing this.

m-mohr · October 19, 2022, 9:24am

Hey there,

we are giving a talk today in the Pangeo showcase at 12pm EDT, which should give a good insight into how openEO compares to Pangeo: Wednesday October 19th 2022: openEO: What it is and how it relates to Pangeo We’d love to see you around!

I had a discussion recently with @annefou and some years(?) ago with @rabernat. We always agreed that we are not in competition, but can greatly benefit from each other as the projects mostly work on different “layers” or “levels” (technology stack vs. API specification). For example, some of the openEO implementations use the Pangeo stack (see the post Tom has linked to). We are always happy to discuss further steps so that the two communities benefit from each other as much as possible and align wherever useful/possible.

Best,
Matthias

geynard · November 4, 2022, 10:50am

Just wanted to share some more (personal and opinionated, sorry ) thoughts about the OpenEO vs Pangeo approaches and discuss them here. I’m sorry I missed the showcase (just watched the recorded version), so I probably also missed some discussions.

First, I agree that Pangeo and OpenEO don’t work on the same layer on a first approach, and probably don’t target exactly the same needs. But I think there is sufficient overlap to see where we could work together, or at least explain key differences between the two.

Key features of OpenEO
@m-mohr, I’ll probably need your review here (and in the rest of the post also)!

OpenEO defines a new API, which is language agnostic: we can use it with R, Python or Javascript.
Its target is to provide first a high level API that scientist (not computer science expert) can use to define workflows that will be able to run on different data centers (with different back-end implementations). It can be seen as a top-down approach.
One of its goal is still to be able to explore huge datasets.
It is based on client/server architectures and interactions.
Various back-ends implementation exists, mainly based on Pangeo stack or Spark/Java tooling. OpenEO API has been build with that in mind.
The end-user don’t (really?) have to understand the back-ends implementation, computing infrastructure, storage layer or file formats.

Key features of Pangeo
Please other here correct me if I’m wrong.

Pangeo environment is a set of Python libraries aiming at providing tools to ease open, reproducible and scalable geoscience.
It’s main high level API is Xarray, which rely on standard such as Numpy arrays.
It’s build on top of open source and widely use Python packages, improving their integration together and their ability to scale. It’s more a bottom-up approach.
A Pangeo platform is kind off a server-less computing environment, everything happens in the same infrastructure (OK, there are possibilities with Coiled or other to run your client code on a local laptop).
The end-user will probably need a deeper understanding of all Pangeo layers to fully take advantage of all of it’s possibility. For example, understanding ARCO data format is a key to use Pangeo at its full possibility.

My (biased) opinion on OpenEO vs Pangeo

I think an advantage of Pangeo is the fact that it is build on tools user already know and use on small datasets. You can easily use Pangeo on your laptop with a Conda environment, and then scale on a Cloud or HPC based platforms.
On the contrary, with OpenEO, you’ll need to learn a new language,
OpenEO adds up another layer between the user and the processing: OpenEO graphs will be translated on Pangeo back-end using Xarray for example.
I’m also not sure if you can easily develop locally (I think this is an identified improvement)?
One advantage of OpenEO is that it is language agnostic, and more high level (directly targeted at EO processing), so probably easier to adopt for scientists? You shouldn’t have to worry about distributed computing and data storage with OpenEO.
But it’s also kind off “Yet Another DAGs Generator”, or “Yet Another Datacube technology”. You probably know the comic already.
I think you can also use an Editor to implement workflows with OpenEO, so you’re not even forced to write code.
I’m comfortable with Pangeo, because I’m a software engineer, not a scientist, but I recognize that it often needs a deep understanding of distributed computing and some knowledge about infrastructure and storage format. Debugging Dask is not always fun.
But I also guess than debugging large computations on OpenEO various back-ends might not be easy. There seems to be a lot of Map/Reduce logic involved, so probably not for beginners either when working at scale.
And having access to all the layers within Pangeo can help in understanding what you do and optimizing things better.

So in the end, as you said, there are sufficient differences in the design and the target audiences to have the need for both approaches, but also room for improvement on their similarities.

Some questions I have about OpenEO

If I understood correctly, various back-ends implementations don’t necessarily implements all the existing processes. What level of compatibility and workflows sharing can we expect when working with several OpenEO providers? One back-end based on Spark will probably not behave like another based on Pangeo?
How easy it is to optimize processing at scale, do you need some understanding of implementations and dataset formats/chunking?
What drove the API design logic with processes and MapReduce operations? If I understood correctly, it’s not an OGC standard (but you’re aiming at it?), is it still based on some OGC standards (I hear things in the talk about OGC APIs)? How the different processes were defined?
Am I correct when I say that OpenEO is a client/server architecture? The client will send request to a centralized server (one for each OpenEO provider), that then relays the request to the back-end?

Subjects on which OpenEO and Pangeo can work collaboratively
You already identified most of them!

Provide a re-usable Pangeo based OpenEO backend, work already in progress. This would mean then that if you already provide a Pangeo platform, you could more easily provide a OpenEO interface. Is there some Open Source development ongoing somewhere?
Based on this, provide a way to use OpenEO on everyone’s laptop with any datasets. But will it be as simple and installing a Conda environment?
One big challenge: OpenEO and Pangeo have their own APIs, how could we bring them back together? There is already a discussion on this and using the Array API Standard, thanks to @benbovy.
At least, it would be really nice if we could provide some ways to access the same datasets with both Pangeo and OpenEO. As we are often working with the same technology stack, maybe this would be possible with some OpenEO providers? This would be nice if a user could chose between using Pangeo API or OpenEO when working on a given data center.
Or maybe we should look for cooperation on lower level libraries like STAC, Zarr or any storage format or infrastructure optimizations?

m-mohr · November 9, 2022, 12:06pm

Hey @geynard,

thanks for this extensive post. I’ll try to comment/reply to it step by step.

Your summary of openEO sounds pretty accurate. The supported programming languages can be extended pretty easily by lightweight new clients in e.g. Julia or Java. One important aspect that you did not explicitly mention is the availability of the openEO processes, which are actually more important for interoperability than the API itself.

Indeed, although not all users use the Pangeo stack so for many there’s a learning curve anyway (think other languages, new students, …). The scaling from small datasets to larger datasets is often non-trivial and could be easier with openEO (assuming the back-end implementation is good enough). Also, working with some data scientists in the past has shown that often their understanding of the technology (e.g., xarray) is very barebone and they barely get things done. But that’s true for all ecosystems and languages. Another example is that if you have a team of e.g. R and Python programmers, it is nice to just let them use the same thing so they can collaborate more easily.

Indeed, that’s an open issue we try to mitigate with the following approaches right now:

Improve remote debugging capabilities
We are implementing a “local back-end” so that you can compute and experiment locally
In openEO Platform, we are implementing a more interactive approach similar to GEE so that you can show results on a map more easily and more rapidly.

Yes, I know it. It’s actually on the pinboard right behind me.
When we started we did not find any comparable technology that lets us work across languages in a reproducible way on large datasets. And I still don’t really see one so I feel like there’s good reasoning to have openEO around unless a well established project gets ported and adopted across languages, e.g. an xarray API for R and Julia with simplified approaches for scaling up.

Yes, for example https://editor.openeo.org/?server=https%3A%2F%2Fopeneo.cloud&discover=1
Although the model building doesn’t make it a lot easier as you still need to (implicitly?) understand important aspects of (functional) programming and data cube handling. Ideally, it would get more into a direction where can just formulate what you want to know instead of thinking about what reducers, filters etc are. But that’s a whole new story and out of scope for openEO. It could be built on top of openEO, but it could also be built on top of the Pangeo stack.

It is very complex indeed, but it will be done by software engineers in the background (ideally). Users don’t and usually can’t work on that aspect. It’s hidden from them so that they don’t need to care. Users implicitly tell the implementation how to optimize by choosing the data cube processes. So by telling what you want to achieve the backend can draw conclusions and optimize correspondingly.

Yes, each back-end can choose which processes to implement. They can also implement custom ones and users can create new processes, too. We are aware that this might reduce interoperability a bit, but usually, we see a similar set of processes being implemented and for openEO Platform for example we have a set of core processes defined that a back-end must support: Processes | openEO Platform Documentation
It’s fine if some back-ends implement only a small subset and work in their own niche for example. As long as there are enough back-ends around that are similar so that you can switch between them, we are fine with it. The openEO Hub for example lets you paste a process graph and show all compliant back-ends so you can choose where to run it. Usually, it shows multiple of them.

That’s what the process descriptions actually try to reduce as much as possible so that the result a process generates on Spark and Pangeo-based is actually comparable (with regards to the computed values, not with regards to performance or so). This is an additional burden, of course, but if you start with openEO usually that’s one of your goals anyway. There will be subtle differences that we can’t always neglect but we ran some experiments where multiple back-ends with different tech stacks actually returned comparable results and the main differentiator was often just the pre-processing of the datasets.

From a user perspective, dataset formats/chunking is usually not an issue, but sometimes it is (unfortunately) still the case that some understanding of the implementation is required. That will reduce over time with the implementations getting more mature.

Could you elaborate a bit more on what you are heading for?

Yes, right now it is not an OGC standard. We may submit it as an OGC community standard in the future. We tried to built on top of OGC standards though, but when we started we were still stuck with the old OGC standards and the OGC APIs were mostly not there yet. Still, we aligned a lot in the past. So we re-use many of them as much as possible for the basic API architecture, maps, tiles, data discovery etc. What we explicitly do not align with is OGC API - Processes as it doesn’t have a good way to freely chain processes yet and also, most importantly, OGC doesn’t define processes at all. But that’s actually the most difficult part that openEO solves. The API itself is rather slim compared to the processes.

Yes, except for some nuances that is probably correct to say.

I basically agree with all your points here. I hope we can steer this.

We have an open source implementation at GitHub - Open-EO/openeo-processes-python: A Python representation of (most) openEO processes, but it has aged rapidly and was more a proof of concept with some architectural issues. There’s now an initiative to rewrite this, but unfortunately the company that is writing this has chosen to start closed and release it to open source once it has reached a certain maturity. Once it is open source though, I could see this becoming a good point for collaboration. It basically is a set of (optimized) process implementations based on Pangeo that other could also re-use and Pangeo could contribute to.

That’s the aim, although I think the plan right now is to make it available as a Docker container.

There are a couple of ways, but all of them require major work and I’m not sure what can be achieved.

Implement a Pangeo-based backend that runs locally.
Add an Xarray or Array API-based interface for the openEO Python client. This only helps for Python though as the Array API proposal is very biased towards Python. To be a real “Array API” and not just a “Python Array API” it would need involvement and alignment between multiple ecosystems and languages (e.g. R, Julia, openEO, OGC?, …)

I’m not sure yet how the collaboration could look like on the dataset level.

I hope I captured all the important points here, it was quite a lot to digest

Best,
Matthias

geynard · November 10, 2022, 7:54am

Thanks a lot @m-mohr for the complete answer, this is very much appreciated! And I agree with all the things you just said here.

I was just wondering if the functional programming nature of OpenEO API was driven by the foreseen back-ends implementation (e.g. Spark), or if it was just a natural choice considering the operations/processes that you identified and the need for scaling up. But I guess this is somehow a combination of several aspects.

I’m waiting for it!

I had something in mind like a Xarray.from_openeo() way of opening datasets, but that’s assuming on a given OpenEO provider you could also deploy a Pangeo platform with Dask workers in order to have “local” data access.

Yeah I know, sorry for the long post , and thanks again for the comprehensive answers.

m-mohr · November 10, 2022, 10:29am

Yes, several aspects come into play here:

The project originated in @edzer’s lab. He is of R fame and R is functional…
Functional just makes a lot of sense when processing large data volumes and transferring “working instructions” between two entities. It’s a bit like writing down cooking recipes for the backend
From the “functional” instructions it’s easier to figure out how to chunk and parallelize. By having something like “Reduce the temporal dimensions” you know you can arbitrarily chunk and parallelize as long as the temporal dimension doesn’t get chunked. Try that with nested for loops or so…
Indeed, also the software stack seemed mostly oriented towards “functional” (but honestly, this wasn’t considered too much in the beginning)

Me, too

I see, but the assumption won’t hold, I think. The opposite might be the better idea, i.e. to have something like openeo.result.to_xarray(). Proposed in Easily access result through xarray (to_xarray) · Issue #340 · Open-EO/openeo-python-client · GitHub

Topic		Replies	Views
Wednesday October 19th 2022: openEO: What it is and how it relates to Pangeo Pangeo Showcase	6	1106	October 21, 2022
What should a Pangeo 2.0 cloud tech stack look like? News & Announcements	12	587	September 27, 2024
OPeNDAP vs. direct file access Data	32	4545	January 27, 2021
Hello Pangeo! News & Announcements	3	758	September 12, 2019
NASA Funding Opportunity: Support for Open Source Tools, Frameworks, and Libraries News & Announcements	19	2227	November 17, 2020

Pangeo OpenEO Backend?

Related topics