Large GeoSpatial Benchmarks: First Pass

mrocklin · October 22, 2024, 3:08pm

Last month we asked for TiB scale geo workloads to form a benchmark suite. We got strong response. Since then we’ve built out these into a public suite.

This post goes over what’s implemented and early results

mrocklin · October 22, 2024, 3:18pm

Small fixes already coming out of this:

github.com/pydata/xarray

Reduce graph size through writing indexes directly into graph for ``map_blocks``

pydata:main ← phofl:graph-size

opened 03:09PM - 22 Oct 24 UTC

phofl

+33 -10

- [ ] Closes #xxxx - [x] Tests added When looking at then ``map_blocks`` var…iant of the benchmark [here](https://github.com/coiled/benchmarks/blob/main/tests/geospatial/test_climatology.py), I noticed that we had a 30 MiB graph for the medium variant. 10MiB of those were just the repeated adding of the PandasIndexes as an argument for map_blocks. Writing them directly to the graph will de-duplicate the value, and thus only have this object once instead of many many times. We can then reference the key for the function arguments. The tokenise adds some overhead, so there is a drawback of this. Happy to open an issue if required Is this something that you all would consider merging? cc @dcherian

Topic		Replies	Views
Large Scale Geospatial Benchmarks News & Announcements	2	256	October 22, 2024
Geographic Index Data	15	2701	October 22, 2020
What's Next - Software - Massive Scale	7	726	December 21, 2023
Tables, (x)arrays, and rasters¶	18	3093	November 15, 2022
Map_blocks and to_zarr(region=) Data	6	1279	May 26, 2022

Large GeoSpatial Benchmarks: First Pass

Related topics