Last month we asked for TiB scale geo workloads to form a benchmark suite. We got strong response. Since then we’ve built out these into a public suite.
This post goes over what’s implemented and early results
Summary: We implement several large-scale geo benchmarks. Most break. Fun! This article describes those benchmarks, what they attempt, how they break, and the technical work necessary to make them ...
10 Likes
Small fixes already coming out of this:
pydata:main
← phofl:graph-size
opened 03:09PM - 22 Oct 24 UTC
- [ ] Closes #xxxx
- [x] Tests added
When looking at then ``map_blocks`` var… iant of the benchmark [here](https://github.com/coiled/benchmarks/blob/main/tests/geospatial/test_climatology.py), I noticed that we had a 30 MiB graph for the medium variant. 10MiB of those were just the repeated adding of the PandasIndexes as an argument for map_blocks. Writing them directly to the graph will de-duplicate the value, and thus only have this object once instead of many many times. We can then reference the key for the function arguments.
The tokenise adds some overhead, so there is a drawback of this.
Happy to open an issue if required
Is this something that you all would consider merging?
cc @dcherian
1 Like