The spatial data used is stored in different formats NetCDF, Zarr, GeoTiff. Data are indexed in related databases to easily search for the spatial and temporal location in a given dataset. This approach requires the implementation of a database, data ingestion, SQL queries, and other stuff.
In new architectures that we imagine around the Pangeo software stack (Dask, Xarray), this type of design is not the most suitable. So we conceived designing a new library, achieving a geographic index with an embedded database. This prototype builds around the GeoHash algorithm. Indexes calculated stores in a Key/Value embedded database.
Geohash
The full description of this algorithm describes on this Wikipedia page.
But to make it short, the idea is to transform the coordinates of a position defined by longitude and latitude into one number where the bits representing this coordinate interleaves into the same integer encoded on a 64-bit integer.
To truncate the value to a given precision, we remove the unnecessary bits. Each value obtained divides the earth into bounding boxes, greater or smaller, depending on the defined accuracy. This value can be encoded in a 32 base, to represent the code as a string of characters. The next figure shows the boxes obtained for a precision of 2 characters.
The library implements different functions to find the GeoHash codes neighboring a given code, to generate a GeoHash grid for a given precision.
The figure below shows an indexed SWOT track.
Storage
For the moment, the technical solution chosen to store the index generated is UnQlite. But we can imagine using RockDB in the future. The database is always a small file because of only elementary information stores for each GeoHash code generated at the desired accuracy. In our example, we have indexed satellites passes stores in NetCDF or Zarr files. In the case of NetCDF files, we stored, for each GeoHash code, the concerned file associated with the start and end indexes covering the area corresponding to the GeoHash code. For Zarr files, we stored only the indexes associated with the GeoHash code. The chosen architecture allows storing any value that can be pickled.
Example of use
def get_geohash_from_file(paths: List[str], precision: int
) -> Dict[bytes, Tuple[str, Tuple[int, int]]]:
"""Creation of an associative dictionary between GeoHash codes and the
file name start and last indexes of the data selected for that code."""
hashs = collections.defaultdict(list, {})
for path in paths:
points = read_positions_satellite(path)
# Calculate geohash codes for each position read
idents = geohash.string.encode(points, precision)
# Calculate indices when the geohash codes are different
indexes = np.where(idents[:-1] != idents[1:])[0] + 1
# Create a tuple that contains the first/last index for each box
data = zip(np.insert(indexes, 0, 0), indexes)
# Finally create a map between geohash code and first/last index
for code, coordinates in zip(idents[np.insert(indexes, 0, 0)][0:-1], data):
hashs[code].append((path, coordinates))
return hashs
index = geohash.index.init_geohash(
geohash.storage.UnQlite(UNQLITE, mode="w"),
precision=3,
synchronizer=None)
data = get_geohash_from_file(path_to_nc_file, index.precision)
index.update(data)
In terms of search performance, a query takes a few milliseconds for one bounding box. This performance is directly related to the performance of the chosen storage engine.
len(index.query_box(
geohash.Box(geohash.Point(-135, 0),
geohash.Point(-90, 45))))
What do you think of this approach? Good? Bad?