Geographic Index

The spatial data used is stored in different formats NetCDF, Zarr, GeoTiff. Data are indexed in related databases to easily search for the spatial and temporal location in a given dataset. This approach requires the implementation of a database, data ingestion, SQL queries, and other stuff.
In new architectures that we imagine around the Pangeo software stack (Dask, Xarray), this type of design is not the most suitable. So we conceived designing a new library, achieving a geographic index with an embedded database. This prototype builds around the GeoHash algorithm. Indexes calculated stores in a Key/Value embedded database.

Geohash

The full description of this algorithm describes on this Wikipedia page.
But to make it short, the idea is to transform the coordinates of a position defined by longitude and latitude into one number where the bits representing this coordinate interleaves into the same integer encoded on a 64-bit integer.

GeoHashInterleave

To truncate the value to a given precision, we remove the unnecessary bits. Each value obtained divides the earth into bounding boxes, greater or smaller, depending on the defined accuracy. This value can be encoded in a 32 base, to represent the code as a string of characters. The next figure shows the boxes obtained for a precision of 2 characters.

The library implements different functions to find the GeoHash codes neighboring a given code, to generate a GeoHash grid for a given precision.

The figure below shows an indexed SWOT track.

Storage

For the moment, the technical solution chosen to store the index generated is UnQlite. But we can imagine using RockDB in the future. The database is always a small file because of only elementary information stores for each GeoHash code generated at the desired accuracy. In our example, we have indexed satellites passes stores in NetCDF or Zarr files. In the case of NetCDF files, we stored, for each GeoHash code, the concerned file associated with the start and end indexes covering the area corresponding to the GeoHash code. For Zarr files, we stored only the indexes associated with the GeoHash code. The chosen architecture allows storing any value that can be pickled.

Example of use

def get_geohash_from_file(paths: List[str], precision: int
    ) -> Dict[bytes, Tuple[str, Tuple[int, int]]]:
    """Creation of an associative dictionary between GeoHash codes and the
    file name start and last indexes of the data selected for that code."""
    hashs = collections.defaultdict(list, {})
    for path in paths:
        points = read_positions_satellite(path)
        # Calculate geohash codes for each position read
        idents = geohash.string.encode(points, precision)
        # Calculate indices when the geohash codes are different
        indexes = np.where(idents[:-1] != idents[1:])[0] + 1
        # Create a tuple that contains the first/last index for each box
        data = zip(np.insert(indexes, 0, 0), indexes)
        # Finally create a map between geohash code and first/last index
        for code, coordinates in zip(idents[np.insert(indexes, 0, 0)][0:-1], data):
            hashs[code].append((path, coordinates))
    return hashs

index = geohash.index.init_geohash(
    geohash.storage.UnQlite(UNQLITE, mode="w"),
    precision=3,
    synchronizer=None)
data = get_geohash_from_file(path_to_nc_file, index.precision)
index.update(data)

In terms of search performance, a query takes a few milliseconds for one bounding box. This performance is directly related to the performance of the chosen storage engine.

len(index.query_box(
    geohash.Box(geohash.Point(-135, 0),
                geohash.Point(-90, 45))))

What do you think of this approach? Good? Bad?

2 Likes

This is a fascinating idea! Thanks so much for sharing. It will take me a while to digest, but I think the concept is great.

This is interesting I had a look at this a while back after @pbranson introduced me to space filling curves. I think adding a Geohash also http://www.cmar.csiro.au/csquares/spec1-1.htm at the time aroused great interest. Personally I think building the caching and key generation in as an xarray assessor would be amazing allowing the fast indexing of the non sorted data sets such as argo floats

Thanks for posting this! This post was the push I needed to join the community and join the discussion. I have used geohashes for spatial indexes in many applications and generally been very pleased with them. I have generally used them to build indexes for vector data but I could see some use cases for other datatypes as well. I have a few comments and questions based on my experience.

Questions:

  1. It is likely my ignorance of SWOT data but can you share a bit more about what your data look like. Do you have a bunch of small files where each one covers a small section of the earth? Or is this a few large files covering large parts of the globe?

  2. What types of operations and workloads are you trying to optimize? For example, are you looking retrieve a single value for some particular point on earth? Are you looking for all files that correspond to a particular region on earth to perform some sort of large scale operation? Some other application?

Comments:

  1. One of the trickiest parts of using geohashes for indexing is deciding what length of geohash you want to use. In my experience, this largely depends on the scale of your data and the queries/operations you want to perform. There generally isnā€™t a ā€œone size fits allā€ approach that works everywhere. This often needs to be tuned for each workload and data set. Luckily there arenā€™t that many options and a little ā€œback of the envelopeā€ calculation will often get you pretty close to a good choice.

  2. Related to my first comment but one minor issue with geohashes is that their size is not uniform. The size of the geohash varies with latitude. This can sometimes be important for very specific applications like bounded nearest neighbor searches. I recently wrote a short blog post about geohash size variation you might find useful or interesting. https://bhaugen.com/blog/geohash-sizes/

In general, I am very much in favor of geohash-based indexing and would encourage exploration in this direction.

1 Like

The SWOT data disseminated to users will be NetCDF format files covering half of the mission orbit (see the figure of the track above). Higher resolution products will be available, covering only a small geographical area.

These data will probably not be used on the CNES cluster to run studies. A format compatible with Dask, such as Zarr, should be used. The technical solution has not yet been decided, but there is a 95% chance that Zarr will be used.

The index will be used to calculate, for example, crossover points for this mission or with other classical altimetry missions. The index will allow us to select exactly what data is common between two tracks in a given spatial/time window. Typically +/- 10 days.

But we can use it to co-locate SWOT data with ARGO or other spatial data like S3.

The variable sizes of the GeoHash boxes are not a problem for us at the moment, as we want to use them to divide our search problem. Once the data has been selected, we have to consolidate it in any case before processing it.

Another solution I want to examine is the technical solution implemented at Uber: https://eng.uber.com/h3/

1 Like

It sounds to me like using geohashes to create an index would be a reasonable approach to finding crossover points from other datasets.

We also looked at h3 but eventually settled on using geohashes. H3 has some nice properties that are useful for some applications. For example, I believe the hexagons are uniform in size and donā€™t get smaller near the poles. The primary reason we chose geohashes was the ability to truncate the hash to have an index at a lower resolution without recomputing the hash. For example, if you have the hash ā€œezs42ā€, every point inside that hash also falls inside of ā€œezs4ā€.

H3 cells are only ā€œapproximately contained within a parent cellā€. This means that if you want to change the resolution of your index, you need to recompute the index for every item. The property is less important if you anticipate only using your index at a single resolution.

2 Likes

Thanks for this @fbriol !

I recently used h3 on the nadir altimetric data in order to bin data geographically and extract tidal responses from the topex/poseidon altimetric database (with the help of pytide of course :wink: ). This was part of a TIPE project for three students. The implementation was very smooth.

What library are you actually relying upon with geohash?
I missed this when going through your post
My 2 cents about this is that the size of the community behind each of these tools should weight in your ultimate choice.

Also do you plan any interactions between the geohashing and pangeo-pyinterp?

1 Like

The goal is to have a geographical index on a set of files easily used with Dask or another HPC calculation solution.

Creating the index is part of the problem. The other problem is storing this index.

The purpose of this prototype study (https://github.com/fbriol/pangeo-geohash) is to explore these two issues. The GeoHash codes are easy to generate, and itā€™s swift. Storage in a key-value database is the right solution. Competitive updating is difficult now, but I think this is not a problem. For now, I have tried two storage engines: UnQlite and LevelDB. LevelDB has too many limitations on access to be usable. Facebook has developed RocksDB to fix these and other problems. For now, UnQlite is the right solution, RocksDB, maybe for the future.

Indexes, H3 is maybe an alternative that can be easily implemented, in addition to GeoHash codes in a library. Iā€™ll try to do some tests on this library next week.

Is there a relationship between the pyinterp library and a GeoHash library? Iā€™m not sure.

Then I donā€™t know the communities around this techno stuff at all.

After some thought, I wonder if Iā€™m not going to integrate the geohash into the pyinterp library. In the end, indexing the data is fine, but then you have to process them to select in most cases the closest points. In pyinterp, I have the RTree structure, which allows me to do this. Whatā€™s more, it makes only one library to maintain. What do you think about it?

I think it does make sense @fbriol.

My ideas are not so clear on pyinterp however: does it support chunked xarray DataArrays at the moment?

Yes you can build interpolators from xarray, but the arrays cannot be chuncked because they must be contiguous in memory.

I am under the impression there may then be a double motivation for adding geohash to pyinterp for workloads with large data arrays.
Couldnā€™t you indeed grouping data geographically with geohash and leverage the groups to distribute an interpolating workload ?

Yes. Also, in most of our use cases, we will search for the closest points to a given position to perform an interpolation.

Iā€™ve just started working on indexing geographical data using a similar approach. But rather than using geohashes, I reuse the s2geometry library, which is also used in popular databases like arangodb and mongodb and which presents several advantages over geohashes (map points on connected hilbert curves instead of a z-curveā€¦ locality is better preserved).

I had a brief look at H3, but I think it currently lacks some built-in features for querying indexes like nearest neighbor lookup, etc.

s2geometry has Python bindings created with SWIG, but those are not really vectorized, so I created some custom numpy-compatible wrappers in https://github.com/benbovy/pys2index (very much work in progress).

Some quick and naive benchmarks show that it is quite efficient! For example, the query times compared to scikit-learnā€™s BallTree with the haversine metric:

image

Index build time is also significantly faster for large data.

This will eventually be included in the xoak (https://github.com/ESM-VFC/xoak) library (very much work in progress too, with @willirath), which provides all the convenient API to use those indexes for xarray advanced indexing (currently as xarray accessors but it should just provide xarray-compatible indexes once xarray supports flexible indexes).

3 Likes

@fbriol, FWIW, I updated the benchmarks by including pyinterp.RTree (BTW, pyinterp looks very nice!):

I should mention that pys2index.S2PointIndex is a direct wrapper of s2geometryā€™s S2PointIndex, which I think uses a B-Tree in memory, still not very suited for indexing massive amounts of data in distributed environments.

Itā€™s not relying on key-value embedded database of any kind, even though I think this would an idea very exciting to explore (among others? Daskā€™s Actors?), using either GeoHash, H3 or s2geometry.

That said, itā€™s not yet very clear to me when exactly (which use case? which data?) those alternatives would start showing real benefits over common, in-memory (tree-based) solutionsā€¦

Note that SpatialPandas already provides Hilbert-curve based spatial indexing using Dask, vectorized with Numba. In benchmarks I seem to remember seeing that it was much faster than geohashing, but Iā€™m not sure where that data ended up. See https://nbviewer.jupyter.org/github/holoviz/spatialpandas/blob/master/examples/Overview.ipynb for an overview.