I’m starting this new topic in order to get insights from the community on work we’d like to start at CNES working with international engineers and scientists. As the (joke) title says, work and contribute to open source software, services or standards (and internal services) to make possible the deployment of an equivalent of Google Earth Engine and/or Microsoft Planetary Computer for any institution.
External work/building blocks/lacking points we identified so far:
Enable interactive and programmative visualisation inside Jupyter notebooks. e.g. having a code cell on the left, and a GIS viewer or alike on the right side, à la GEE. Using lazy Xarray and JupyterGIS or Datashader toolset should make this possible. Bonus point for 3D datasets. Started some work with Quantstack on JupyterGIS already, @davidbrochart.
Going from notebooks to Dashboard with EO Dataset and complex analysis, Dashboarding in geoscience.
EO data to Xarray, STAC to Xarray. I know there are some existing tools, but they often lacks some functionalities or are note well maintained, like rioxarray, geoutils, odc-geo… We started discussing with @remi-braun and the library EOReader about this.
Work on Satellite data file formats. GeoZarr, CoG…
Enable easy building of STAC catalog from a random bucket/directory of EO Data.
Internaly, we have also things to achieve:
Open our S3 enabled Datalake to the outside world, which is more an infrastructure and security question.
Make our services (like Geodes catalog) more standard (real STAC).
Provide a Binder like service with all access to our Data and catalog and big enough resources. Binder on HPC, or Cloud at CNES.
I know there are plenty of work going on about all that, and I’m curious to get some feedback of members on this community on any of these points. It would be really valuable for us to precisely identify where we could help the most!
Thanks for opening this topic!
There is a loooot that has been done, but I think there is a lot left to do on the complete maturity path.
On the dask / xarray / STAC / EO data topic, I’d say the “usual” usecases are well covered : taking an already orthorectified free data such as Sentinel-2 and open it in xarray or create a STAC item is really easy to do, even in lazy mode.
However, things are less mature whith edge use cases and lazy loading (i.e. dealing with proprietary data that need for example a RPC orthorectification)
There is no open source library that aims to port all GDAL utilities in a xarray/lazy way. What we have currently on the shelf is:
an almost GDAL-complete library but not designed for lazy xarray purpose (rasterio)
a bit less GDAL-complete library that works with xarray but not with dask and don’t really intend to (rioxarray)
a usecase dedicated library that is only porting their needs to lazy xarray (odc-geo which has very good features such as a dask-compatible COG writer)
a promising but early stage development library (geoutils)
Thanks for this answer! However, I am not sure we are talking exactly about the same things here. Or maybe I just didn’t correctly understand what you meant. Sorry in advance
I would first state that I think that modern geospatial libraries should rely on xarray (for rasters) and be able to leverage dask if possible to do lazy operations. If implemented, I would define those libraries as mature enough and well integrated to the emerging geospatial ecosystem. Of course this statement is debatable, but xarray and dask are used enough now to not just be a trend with no future.
As you pointed out, on the road to GDAL-completeness, GDAL bindings in Python are top-notch and would hardly be beaten by nature. However I never found any proper tutorial and documentation working with ogr + xarray + dask.
In my opinion (that can be of course changed if proper documentation is linked), there is still some fair amount of work to do to have a user-friendly (and I put an easy conda/pip install in this, wink to GDAL) mix of GDAL + xarray + dask usable easily by any geospatial developer.
All this has to be read from a user perspective : I am sorry but cloud native distributed geospatial Python is still very very complex to handle and have many gotchas, dead-ends or duplicated implementations
However I never found any proper tutorial and documentation working with ogr + xarray + dask
That’s right (gdal not ogr tho), I’m hoping to change that, but I’m a very early in process Python developer. I’m really appealing to the broader community, which I think sadly reads what GDAL is through one influential downstream library. The need and appropriateness of that downstream role has passed. Still, we can craft way smarter inputs to rasterio/dask graphs than we get currently, vrt:// connection goes a long way and as Rich Signell points out we’ll have entire pipelines expressible as short strings and that will improve what rasterio can do immensely, but will still be stuck behind the 2D bands model only, (except in some special cases where casting down is beneficial like with vrt:// transpose.)
I know this isn’t helpful to you today, but if it’s triggering of interest to anyone (hopefull) think it’s worth sharing. I’m trying to combat what I see as sometimes slightly mistaken mantra.
Thanks for the reply already. Also pinging @maxrjones because I’d love some feedback on how to help with Geozarr and file formats (and more generally DevelopmentSeed view on all that), and @TomAugspurger for his experience at Microsoft.
Great topic and discussion! I hope to get a bit of time soon to share some more general thoughts, but just on the GeoZarr question there will be a special LPS preparation meeting Tuesday May 27 at 5 PM CEST if you’re available. The connection information is available on the community calendar.