Large Scale Geospatial Benchmarks

jrbourbeau · September 11, 2024, 9:11pm

Hi All,

We’re looking to build out a collection of large-scale, end-to-end geospatial benchmarks to ensure that tools like Xarray, Dask, etc. operate smoothly up to the 100-TB scale. @mrocklin and I wrote down a few characteristics we think make a good benchmark based on our previous experience using TPC-H benchmarks to improve Dask DataFrame.

Blogpost: Large Scale Geospatial Benchmarks — Coiled documentation
GitHub discussion: Large Scale Geospatial Benchmarks · coiled/benchmarks · Discussion #1545 · GitHub

If folks here have thoughts on geospatial benchmarks they think would be a good fit, we’d love to collaborate. Please leave a comment in the above ^ GitHub discussion.

KBodolai · September 12, 2024, 2:32pm

Just wanted to say, I love the initiative.

I’ve been working on some pretty hefty time series workflows with geospatial data that definitely fall within what you’re targeting here, and I’d love to contribute to these.

mrocklin · October 22, 2024, 3:08pm

Follow-up here: Large Scale Geospatial Benchmarks: First Pass — Coiled documentation

Topic		Replies	Views
Large GeoSpatial Benchmarks: First Pass News & Announcements	1	186	October 22, 2024
Cloud Optimized Geotiffs + Pangeo best practices Data	4	2081	January 21, 2021
Large-scale data processing benchmarks for Xarray-Beam	6	1586	June 13, 2022
Blog post: Processing a 250 TB dataset with Xarray, Dask, and Coiled Cloud	0	452	September 5, 2023
Pangeo Showcase: "Dask Array: Scaling Up for Terabyte-Level Performance" (April 9, 2025 at 12 PM ET) Pangeo Showcase	3	320	April 9, 2025

Large Scale Geospatial Benchmarks

Related topics