Large-scale data processing benchmarks for Xarray-Beam

shoyer · May 18, 2021, 9:13pm

I released a new “Xarray-Beam” project on GitHub last week: GitHub - google/xarray-beam: Distributed Xarray with Apache Beam

The idea is to facilitate a different model for large-scale distributed analytics in the Cloud, building on Apache Beam as an alternative to Dask. I’m still working on documenting it, but hopefully the README gives a reasonable overview of the idea.

One thing I’d love to include are a handful of end-to-end examples of large scale data-processing that run out of the box on Google Cloud Dataflow. These should capture the flavor of important data processing workflows for working with weather/climate data.

I’m currently thinking of two demos, based on Pangeo’s publicly available ERA5-surface dataset (17 data-variables adding up to 25 TB total):

Rechunking, from “stack of images” to “time-series” format. (Xarray-Beam leverages Rechunker internally to figure out the optimal chunking scheme.)
Calculating climatological averages over time, per hour of the day and per calendar month.

I like ERA5 surface because it’s relatively high-resolution (and thus makes pretty pictures) and is also quite relatable – most humans have some understanding of surface weather!

My goal is to show how Xarray-Beam could be useful for solving problems the Pangeo community cares about. So towards that end, I would appreciate feedback and suggestions. For example, are there alternative benchmarking tasks and/or datasets that I should be considering instead? I would be particularly interested in cases where we could compare performance to Dask or another distributed computing engine.

RichardScottOZ · May 19, 2021, 3:42am

There’s the Sentinel 2 public dataset on AWS, as another obvious use case. Or Landsat.

Google mentioned Apache Beam to me as a possibility - so I’d be interested in time/cost comparisons between that and other workflows.

darothen · May 31, 2021, 3:46pm

@shoyer do you have any sense of pricing/cost for similar workflows that might be executed via a “traditional” dask cluster versus Beam? I think this might be a really useful set of metrics that could feed into your comparisons with dask or other distributed computing engines.

I’m on paternity leave for the next 2-3 weeks and would be happy to sprint on this with you if you have time.

shoyer · June 3, 2021, 5:51pm

@darothen I would love to get some comparisons to Dask for the same workload! I don’t have a clear answer for you on cost, but my expectation is that it should be in roughly the same ballpark, depending on lots of little details. I have two examples (climatology calculation and rechunking) on this ERA5 dataset worked out for Xarray-Beam that might be interesting to port to Dask: xarray-beam/examples at main · google/xarray-beam · GitHub

My general impression is that the Dask scheduler is much more “clever” than Beam scheduler (or rather, the scheduler behind Beam runners like the Cloud Dataflow). Dask has a lower-level representation of workflows (individual tasks), whereas Beam keeps things as higher-level (e.g., GroupByKey and Map). This means Dask has more opportunities for clever automatic optimizations, but also more opportunities for things to go wrong.

rabernat · June 7, 2022, 12:53pm

There is growing interested in implementing a Dask Runner for Beam. This would make it a lot easier for people with existing Dask infrastructure to try Beam.

In order to kickstart the discussion of implementing a Dask Beam runner, I propose we meet during the week of June 13-17. I have created a When2Meet Poll here - Dask Beam Runner Discussion - When2meet . If you are interested in attending, please give your availability. Hope to see many people there!

dcherian · June 7, 2022, 2:05pm

I can’t make that week (conferences) but would like to be involved in future meetings

rabernat · June 13, 2022, 3:52pm

Thanks to all who replied! We have scheduled the call for Wed June 15, 1:30 pm ET. The zoom link is Launch Meeting - Zoom

Looking forward to the discussion!

Deepak, we will take notes and share them.

Topic		Replies	Views
Optimizing climatology calculation with Xarray and Dask Science	33	4052	December 6, 2024
What's Next - Software - Massive Scale	7	651	December 21, 2023
Xarray unable to allocate memory, How to "size up" problem Data location-uw	9	3113	July 27, 2023
Blog post: Processing a 250 TB dataset with Xarray, Dask, and Coiled Cloud	0	454	September 5, 2023
Pangeo Showcase: "Dask Array: Scaling Up for Terabyte-Level Performance" (April 9, 2025 at 12 PM ET) Pangeo Showcase	3	373	April 9, 2025

Large-scale data processing benchmarks for Xarray-Beam

Related topics