Discrete Global Grid Systems (DGGS) use with Pangeo

I will be going to BiDS and am excited to hack on vector data cubes.

3 Likes

For those who’ll participate in sprint@BIDS23 ( to hack DGGS-Xarray ) , are asked to register at two following web sites.

1 Like

@allixender we won’t host a satellite event within Turing for logistical reasons, unfortunately, but we’re keen to explore the materials of coming Pangeo and partner initiatives workshops in BiDs and OSGeo community events.

I’ll be there as well and really keen to find out what a ‘vector data cube’ might look like! :wink:

2 Likes

I’ve spent some time reviewing these threads as well as the sprint repo. With the sprint less than a month away, it would be great to align on a more concrete set of goals and start thinking about how to divide up the work effectively.

A lot of the focus in this thread is on the DGGRID package, a command-line tool written in C++. The assumption seems to be that this will be the foundation for an integration between DGGS and Xarray, via a potential Python wrapper for this library. I have studied DGGRID, read the code, and read the manual. I’m a bit concerned that there is a very steep road ahead to make use of this package within the Pangeo ecosystem or from Python in general. :flushed:

First, let me say that the science and software engineering behind DGGRID seem very strong. The manual is detailed and covers many deep issues. It seems to excel at its stated use case.

However, the core problem is that DGGRID is (in its own words) “a command-line application designed to generate and manipulate icosahedral discrete global grids”. It does not appear to have been designed to be use as a library. (There is no API documentation for DGGRID.) Crucially, the way DGGRID communicates is via file IO. (This is how the tests are written, for example–to verify that the files produced are the correct ones.) It has no clear way to pass the relevant data structures through its routines in memory, as we can do with, e.g. NumPy array buffers or Arrow dataframes. So the only way to “communicate” with DGGRID is to read and write files (and this is the approach taken in dggrid4py). The filesystem is an extremely slow interface, so such a software design will not scale to the massive datasets we like to process in Pangeo. Finally, there does not seem to be any notion of parallelism (multi-threading, multi-processing) or streaming within DGGRID.

So the challenge here is not merely wrapping a C++ library in order to call it from Python—it is also refactoring a standalone file-based application into a modular library. The crux of the problem will be what in-memory data structures will be used to exchange data between DGGRID and python applications. Making such major changes will be impossible without the cooperation and buy-in from the library developers. And it will be a lot of difficult work in C++, not a language that most folks in the Pangeo community are very fluent with.

Finally, DGGS is licensed with LGPL, meaning that is essentially unusable for many of the commercial applications and users within the Pangeo community. Relicensing under Apache would be a prerequisite for wider adoption.

So overall I’m concerned that a focus on this specific task at the sprint will lead to frustration and a lack of progress at the sprint.


For these reasons, I would suggest that a more productive path would be to focus on the general problem of how to integrate more complex grid types (H3, S2, various other meshes, vector geometries) with Xarray from an API design point of view. What these all have in common is that the two spatial dimension (lat / lon) must be represented with a 1D coordinate variable with specialized indexing operations. The xvec API provides a great example of how to approach this challenge.

I hope you see where I’m coming from with these comments. My intention is not to be discouraging. Rather, my hope is to have a very productive sprint with some big progress on important problems. Some advance planning and feasibility analysis seems like a necessary step toward that.

2 Likes

@rabernat I agree on the challenges of DGGRID. I tried to create C++ bindings to use it in Julia, with little success yet. However, it produces the most equal area cells which is a major point in building a DGGS. And it allows multiple cell shapes. File access is good enough for case studies but not for production runs, of course.
We can also discuss on how to sort the cells in a 1D index in general, which is crucial for chunking and loading time. Furthermore, there is also 2D indices (e.g. axial coordinates Q2DI in DGGRID) which seem to be very interesting when it comes to shape DGGS data in tensors for storage and Neural Networks.

1 Like

Thanks for your reply @dloos! I agree that the algorithms available in DGGRID are state of the art.

My concern is more about the nature of the work involved. Do we have the right C++ expertise (and support from the DGGRID core developers) at this sprint to make the sort of major changes that are needed to turn this into a high-performance library with bindings to multiple languages? I certainly do not! :upside_down_face:

However, I would be happy to discuss the data structure problem. Arrow data structures are becoming a standard for this sort of thing. Could they be leveraged here?

We recently submitted a manuscript along with published datasets using dggrid + hexwatershed: Icosahedral Snyder Equal Area DGGs-based Flow Routing Datasets for the Amazon Basin. This is the first DGGs-based flow routing dataset as far as I know.
The discussion should be open soon.
The recent updates in dggrid really simplifies lots of workflow for our model as well.

Hi @rabernat , all,

thanks for the valuable feedback. Some infos from my side: I agree on the C++ challenge, and I do acknowledge the license situation. I mean, H3 is a c-library, and S2 is a C++ library, Spherely sits on S2, and so do many other projects. But you know that :slight_smile:

I have been working with the main developer, Prof. Sahr, over the course of this year. I have funded him a bit here and there where possible to move DGGRID in the desired direction. The latest version, 8.0b supports spacefilling curves 1d index types Z3 and ZORDER (and longer already unique stable sequence numbers for cells aka seqnum). There is oxygen in the project and we just recently agreed that we will build a sphinx/doxygen readthedocs project to improve the accessibility. There is a great PDF manual for users, but indeed no real API documentation. This project indeed needs more community adoption before it would feel good for it to become reliable foundation for dependant project in the Pangeo/Xarray ecosystem. I will likely try to support this development further.

For the codesprint at hand indeed it might not be as a tangible goal. However, as @dloos said, if we settle too early to low-hanging fruits like H3 we will not develop equal area DGGS (S2 generic Python bindings are still challenging to include in the build process, too).

@tinaok is now getting really interesting results with HEALPIX.

But to get a better shared understanding of the process with indexed cells of a DGGS and Xarray we can find something more tangible for the code sprint. Also, 1-d indexed with a spacefilling curve, moving towards Zarr and Array/Parquet would be amazing.

For these reasons, I would suggest that a more productive path would be to focus on the general problem of how to integrate more complex grid types (H3, S2, various other meshes, vector geometries) with Xarray from an API design point of view.

Since I’ll be at the sprint, I’ll be happy to share my experience working on Python bindings for S2 and also having been working a lot on Xarray indexes in Xarray itself but also in packages like xvec.

I think that a good and reasonable goal for the sprint would be to come up with an xarray-dggs package that would provide a very basic set of features (similarly to xvec):

  • A few Xarray custom indexes that could be built from lat/lon data (or directly from DGGS cell indices) and that would enable data selection using .sel()
    • For HealPIX, the healpy Python package seems to already have the functionality needed, as shown here and from @tinaok’s results. I have not checked if the Python bindings are vectorized, though.
    • For S2, the pys2index library provides Python bindings for point data, which I think already fits most use cases for DGGS. There’s also the more ambitious spherely library but it is not ready yet (still quite some work) and it is more in case we need to handle more complex geospatial features like lines, polygons, etc. I created and maintain both of these libraries, so I can definitely help with adding more functionality if needed. I can also tell that since all the boilerplate is already implemented there (thanks to pybind11, scikit-build-core, etc.), some features could be really straightforward to add with just a few lines of C++ code!
    • For H3, I guess we could reuse the “official” Python bindings? Last time I checked there was only a few vectorized functions, though.
  • Xarray Dataset and/or DataArray accessors for DGGS-specific API
    • Set new DGGS Xarray indexes from lat/lon coordinates
    • get DGGS cell indices as a new coordinate
  • An Xarray I/O backend (e.g., loading data where DGGS cell indices where saved in order to speed-up rebuilding the Xarray indexes)

I think that DGGS grids have enough in common to expose the functionality for all of them in a common xarray-dggs package, maybe with optional dependencies for each backend (healpy, pys2index, h3-python, etc.).

I like your plan Benoit!

I’m not sure we want or needed a dedicated backend (e.g. new file format) just for DGGS. But one thing I’ve been thinking about is to add pluggable decoders to Xarray. That would allow use to use any standard container (e.g. NetCDF, Zarr) to store the data, and then have it decoded to the right in-memory data structure on reading, the way we do today, e.g. with datetimes. This was also discussed in an xvec issue.

Agreed, pluggable decoders to Xarray would be better for this. One more Xarray extension mechanism, I like it :slight_smile: .

1 Like

The sprint on DGGS at the BiDS was very productive! Great discussions with @keewis, @allixender, @rabernat, @tinaok, @strobpr, @dloos, @acocac, @annefou and others I’m missing!

It yielded to the creation of the GitHub - benbovy/xdggs: Xarray extension for DGGS repository (to be moved soon to a better place).

There is a design document about the Xarray extension here: https://github.com/benbovy/xdggs/blob/main/design_doc.md

Any feedback would be very much appreciated! I’ve created Feedback on the design document · benbovy/xdggs · Discussion #22 · GitHub for that purpose.

8 Likes

@benbovy: write a quick blogpost!

4 Likes

We intended to, indeed. It’s a complex topic, with many different perspectives of experts and practitioners. Stay tuned.

2 Likes

Happy to make the Pangeo Blog available for this topic. Just let me know if you’d like to add a post there.

3 Likes

Do we want to write the blog article in a collaborative way? E.g. using a shared Google Docs before publishing? I’d like to add a paragraph explaining why indexing is so important, making a DGGS more than just a tessellation.

2 Likes

Absolutely! This aspect is way too often neglected!

I’d be happy to chime in.

1 Like