This project has quite a bit of traction, particularly in Europe. It would be great to try to connect Pangeo with OpenEO. I think the path forward on this would be to implement–or contribute to–an OpenEO “backend”, which can process and serve data to users via an API.
Sounds like an intake-able API.
I didn’t follow, from a cursory reading, how or where processing happens. I wonder at yet another graph-like representation of computations (otoh: people do not serialise xarray/dask pipelines for recall).
I just want to quickly let you know that we plan to let students write an openEO Pangeo back-end this winter term. That will likely not be production ready in any form, but would give us a first impression to base further work on. Any use case you would find interesting or anything you are particularly interested in regarding pangeo/openEO?
There are several of Python back-ends, but openeo-processes-python is just implementing the processes itself, not the HTTP API. I don’t think the openeo-processes-python take full advantage of pangeo yet, but that could also be explored in the future and any help is highly appreciated.
That’s awesome news @m-mohr! We would love to help out on that project however we can, e.g. helping to define best practices or debug code.
A couple of random thoughts:
In Pangeo world, there are already two DAG-based execution engines: Dask and Prefect. (For example, right now in Pangeo, we can write xarray code which generates a Dask graph for delayed execution and then execute it with a distributed scheduler. This seems conceptually close to what OpenEO is aiming for already.) So it would be ideal if an OpenEO backend could simply translate the OpenEO workflow to one of these existing formats and then pass the execution off to an existing, mature task scheduler.
Xarray has become our de-facto universal API for data analysis. Xarray’s API is similar to, but distinct from, the openEO python API. Pangeo users would likely love to take advantage of OpenEO backend processing, but they probably don’t want to learn a new API. Can we somehow generate OpenEO API calls from vanilla Xarray code? This could be hard, since they use different types of abstraction. (The integration point in Xarray is with the NumPy API, which is implemented by many array libraries, e.g. NumPy itself, Dask, cupy, etc.)
Xcube seems like a really cool project. It already provides a REST API and CLI for interacting with xarray datacubes, and it’s part of the ESA ecosystem. Could Xcube be leveraged here, rather than starting from scratch?
Thanks @rabernat, appreciate the offer and your thoughts. We’ll likely get back to it.
It’s likely that we’ll translate into an existing format for a task scheduler. We usually don’t implement that on our own in openEO.
I guess that’s best discussed with the guys implementing the Python client as I would imaging that being done on the Python client level due to the fact that xarray is only known in Python world and would not benefit R/JS users so much. Maybe there’s room for further alignment. I doubt we can fully align, but maybe make it easier for users to learn the new API?
I have no clue, but that’s likely a point the students can investigate.
I was going to open a thread on Pangeo and OpenEO when I found it was already opened almost 2 years back! Awesome.
As @rabernat said, there’s quite a bit of traction in Europe towards OpenEO, but towards Pangeo too. At CNES and other places, we’re trying to see if the two approaches could work together. As I’m really new to OpenEO concepts (only read the about page), I’ve not anything to add to what Ryan suggested.
Has there been any advances in the subjects discussed here that people know about?
Maybe @PhenoloBoy or @annefou have some thoughts if they saw some talks at the recent ESA Living Planet Symposium?
Indeed OpenEO in Europe, for many reasons, can be seen as a trading technology and, IMHO, within a couple of years the European ‘market’ will be flooded by it and, more specifically, by the OpenEO platform.
@geynard like you I’m indeed a newbie on this topic. I attended a couple at the LPS but all of them were more focused to present the platform than on the API; as the LPS has been quite dispersive maybe I missed the more focused one.
Indeed, from the OpenEO developers, I had the feeling that there is interest in having a dialogue with Pangeo developers but I’m not aware of any initiative.
I had a discussion recently with @annefou and some years(?) ago with @rabernat. We always agreed that we are not in competition, but can greatly benefit from each other as the projects mostly work on different “layers” or “levels” (technology stack vs. API specification). For example, some of the openEO implementations use the Pangeo stack (see the post Tom has linked to). We are always happy to discuss further steps so that the two communities benefit from each other as much as possible and align wherever useful/possible.
Just wanted to share some more (personal and opinionated, sorry ) thoughts about the OpenEO vs Pangeo approaches and discuss them here. I’m sorry I missed the showcase (just watched the recorded version), so I probably also missed some discussions.
First, I agree that Pangeo and OpenEO don’t work on the same layer on a first approach, and probably don’t target exactly the same needs. But I think there is sufficient overlap to see where we could work together, or at least explain key differences between the two.
Key features of OpenEO @m-mohr, I’ll probably need your review here (and in the rest of the post also)!
Its target is to provide first a high level API that scientist (not computer science expert) can use to define workflows that will be able to run on different data centers (with different back-end implementations). It can be seen as a top-down approach.
One of its goal is still to be able to explore huge datasets.
It is based on client/server architectures and interactions.
Various back-ends implementation exists, mainly based on Pangeo stack or Spark/Java tooling. OpenEO API has been build with that in mind.
The end-user don’t (really?) have to understand the back-ends implementation, computing infrastructure, storage layer or file formats.
Key features of Pangeo
Please other here correct me if I’m wrong.
Pangeo environment is a set of Python libraries aiming at providing tools to ease open, reproducible and scalable geoscience.
It’s main high level API is Xarray, which rely on standard such as Numpy arrays.
It’s build on top of open source and widely use Python packages, improving their integration together and their ability to scale. It’s more a bottom-up approach.
A Pangeo platform is kind off a server-less computing environment, everything happens in the same infrastructure (OK, there are possibilities with Coiled or other to run your client code on a local laptop).
The end-user will probably need a deeper understanding of all Pangeo layers to fully take advantage of all of it’s possibility. For example, understanding ARCO data format is a key to use Pangeo at its full possibility.
My (biased) opinion on OpenEO vs Pangeo
I think an advantage of Pangeo is the fact that it is build on tools user already know and use on small datasets. You can easily use Pangeo on your laptop with a Conda environment, and then scale on a Cloud or HPC based platforms.
On the contrary, with OpenEO, you’ll need to learn a new language,
OpenEO adds up another layer between the user and the processing: OpenEO graphs will be translated on Pangeo back-end using Xarray for example.
I’m also not sure if you can easily develop locally (I think this is an identified improvement)?
One advantage of OpenEO is that it is language agnostic, and more high level (directly targeted at EO processing), so probably easier to adopt for scientists? You shouldn’t have to worry about distributed computing and data storage with OpenEO.
I think you can also use an Editor to implement workflows with OpenEO, so you’re not even forced to write code.
I’m comfortable with Pangeo, because I’m a software engineer, not a scientist, but I recognize that it often needs a deep understanding of distributed computing and some knowledge about infrastructure and storage format. Debugging Dask is not always fun.
But I also guess than debugging large computations on OpenEO various back-ends might not be easy. There seems to be a lot of Map/Reduce logic involved, so probably not for beginners either when working at scale.
And having access to all the layers within Pangeo can help in understanding what you do and optimizing things better.
So in the end, as you said, there are sufficient differences in the design and the target audiences to have the need for both approaches, but also room for improvement on their similarities.
Some questions I have about OpenEO
If I understood correctly, various back-ends implementations don’t necessarily implements all the existing processes. What level of compatibility and workflows sharing can we expect when working with several OpenEO providers? One back-end based on Spark will probably not behave like another based on Pangeo?
How easy it is to optimize processing at scale, do you need some understanding of implementations and dataset formats/chunking?
What drove the API design logic with processes and MapReduce operations? If I understood correctly, it’s not an OGC standard (but you’re aiming at it?), is it still based on some OGC standards (I hear things in the talk about OGC APIs)? How the different processes were defined?
Am I correct when I say that OpenEO is a client/server architecture? The client will send request to a centralized server (one for each OpenEO provider), that then relays the request to the back-end?
Subjects on which OpenEO and Pangeo can work collaboratively
You already identified most of them!
Provide a re-usable Pangeo based OpenEO backend, work already in progress. This would mean then that if you already provide a Pangeo platform, you could more easily provide a OpenEO interface. Is there some Open Source development ongoing somewhere?
Based on this, provide a way to use OpenEO on everyone’s laptop with any datasets. But will it be as simple and installing a Conda environment?
At least, it would be really nice if we could provide some ways to access the same datasets with both Pangeo and OpenEO. As we are often working with the same technology stack, maybe this would be possible with some OpenEO providers? This would be nice if a user could chose between using Pangeo API or OpenEO when working on a given data center.
Or maybe we should look for cooperation on lower level libraries like STAC, Zarr or any storage format or infrastructure optimizations?
thanks for this extensive post. I’ll try to comment/reply to it step by step.
Your summary of openEO sounds pretty accurate. The supported programming languages can be extended pretty easily by lightweight new clients in e.g. Julia or Java. One important aspect that you did not explicitly mention is the availability of the openEO processes, which are actually more important for interoperability than the API itself.
Indeed, although not all users use the Pangeo stack so for many there’s a learning curve anyway (think other languages, new students, …). The scaling from small datasets to larger datasets is often non-trivial and could be easier with openEO (assuming the back-end implementation is good enough). Also, working with some data scientists in the past has shown that often their understanding of the technology (e.g., xarray) is very barebone and they barely get things done. But that’s true for all ecosystems and languages. Another example is that if you have a team of e.g. R and Python programmers, it is nice to just let them use the same thing so they can collaborate more easily.
Indeed, that’s an open issue we try to mitigate with the following approaches right now:
Improve remote debugging capabilities
We are implementing a “local back-end” so that you can compute and experiment locally
In openEO Platform, we are implementing a more interactive approach similar to GEE so that you can show results on a map more easily and more rapidly.
Yes, I know it. It’s actually on the pinboard right behind me.
When we started we did not find any comparable technology that lets us work across languages in a reproducible way on large datasets. And I still don’t really see one so I feel like there’s good reasoning to have openEO around unless a well established project gets ported and adopted across languages, e.g. an xarray API for R and Julia with simplified approaches for scaling up.
Yes, for example https://editor.openeo.org/?server=https%3A%2F%2Fopeneo.cloud&discover=1
Although the model building doesn’t make it a lot easier as you still need to (implicitly?) understand important aspects of (functional) programming and data cube handling. Ideally, it would get more into a direction where can just formulate what you want to know instead of thinking about what reducers, filters etc are. But that’s a whole new story and out of scope for openEO. It could be built on top of openEO, but it could also be built on top of the Pangeo stack.
It is very complex indeed, but it will be done by software engineers in the background (ideally). Users don’t and usually can’t work on that aspect. It’s hidden from them so that they don’t need to care. Users implicitly tell the implementation how to optimize by choosing the data cube processes. So by telling what you want to achieve the backend can draw conclusions and optimize correspondingly.
Yes, each back-end can choose which processes to implement. They can also implement custom ones and users can create new processes, too. We are aware that this might reduce interoperability a bit, but usually, we see a similar set of processes being implemented and for openEO Platform for example we have a set of core processes defined that a back-end must support: Processes | openEO Platform Documentation
It’s fine if some back-ends implement only a small subset and work in their own niche for example. As long as there are enough back-ends around that are similar so that you can switch between them, we are fine with it. The openEO Hub for example lets you paste a process graph and show all compliant back-ends so you can choose where to run it. Usually, it shows multiple of them.
That’s what the process descriptions actually try to reduce as much as possible so that the result a process generates on Spark and Pangeo-based is actually comparable (with regards to the computed values, not with regards to performance or so). This is an additional burden, of course, but if you start with openEO usually that’s one of your goals anyway. There will be subtle differences that we can’t always neglect but we ran some experiments where multiple back-ends with different tech stacks actually returned comparable results and the main differentiator was often just the pre-processing of the datasets.
From a user perspective, dataset formats/chunking is usually not an issue, but sometimes it is (unfortunately) still the case that some understanding of the implementation is required. That will reduce over time with the implementations getting more mature.
Could you elaborate a bit more on what you are heading for?
Yes, right now it is not an OGC standard. We may submit it as an OGC community standard in the future. We tried to built on top of OGC standards though, but when we started we were still stuck with the old OGC standards and the OGC APIs were mostly not there yet. Still, we aligned a lot in the past. So we re-use many of them as much as possible for the basic API architecture, maps, tiles, data discovery etc. What we explicitly do not align with is OGC API - Processes as it doesn’t have a good way to freely chain processes yet and also, most importantly, OGC doesn’t define processes at all. But that’s actually the most difficult part that openEO solves. The API itself is rather slim compared to the processes.
Yes, except for some nuances that is probably correct to say.
I basically agree with all your points here. I hope we can steer this.
We have an open source implementation at GitHub - Open-EO/openeo-processes-python: A Python representation of (most) openEO processes, but it has aged rapidly and was more a proof of concept with some architectural issues. There’s now an initiative to rewrite this, but unfortunately the company that is writing this has chosen to start closed and release it to open source once it has reached a certain maturity. Once it is open source though, I could see this becoming a good point for collaboration. It basically is a set of (optimized) process implementations based on Pangeo that other could also re-use and Pangeo could contribute to.
That’s the aim, although I think the plan right now is to make it available as a Docker container.
There are a couple of ways, but all of them require major work and I’m not sure what can be achieved.
Implement a Pangeo-based backend that runs locally.
Add an Xarray or Array API-based interface for the openEO Python client. This only helps for Python though as the Array API proposal is very biased towards Python. To be a real “Array API” and not just a “Python Array API” it would need involvement and alignment between multiple ecosystems and languages (e.g. R, Julia, openEO, OGC?, …)
I’m not sure yet how the collaboration could look like on the dataset level.
I hope I captured all the important points here, it was quite a lot to digest
Thanks a lot @m-mohr for the complete answer, this is very much appreciated! And I agree with all the things you just said here.
I was just wondering if the functional programming nature of OpenEO API was driven by the foreseen back-ends implementation (e.g. Spark), or if it was just a natural choice considering the operations/processes that you identified and the need for scaling up. But I guess this is somehow a combination of several aspects.
I’m waiting for it!
I had something in mind like a Xarray.from_openeo() way of opening datasets, but that’s assuming on a given OpenEO provider you could also deploy a Pangeo platform with Dask workers in order to have “local” data access.
Yeah I know, sorry for the long post , and thanks again for the comprehensive answers.
The project originated in @edzer’s lab. He is of R fame and R is functional…
Functional just makes a lot of sense when processing large data volumes and transferring “working instructions” between two entities. It’s a bit like writing down cooking recipes for the backend
From the “functional” instructions it’s easier to figure out how to chunk and parallelize. By having something like “Reduce the temporal dimensions” you know you can arbitrarily chunk and parallelize as long as the temporal dimension doesn’t get chunked. Try that with nested for loops or so…
Indeed, also the software stack seemed mostly oriented towards “functional” (but honestly, this wasn’t considered too much in the beginning)