Large-scale data processing benchmarks for Xarray-Beam

I released a new “Xarray-Beam” project on GitHub last week: GitHub - google/xarray-beam: Distributed Xarray with Apache Beam

The idea is to facilitate a different model for large-scale distributed analytics in the Cloud, building on Apache Beam as an alternative to Dask. I’m still working on documenting it, but hopefully the README gives a reasonable overview of the idea.

One thing I’d love to include are a handful of end-to-end examples of large scale data-processing that run out of the box on Google Cloud Dataflow. These should capture the flavor of important data processing workflows for working with weather/climate data.

I’m currently thinking of two demos, based on Pangeo’s publicly available ERA5-surface dataset (17 data-variables adding up to 25 TB total):

  1. Rechunking, from “stack of images” to “time-series” format. (Xarray-Beam leverages Rechunker internally to figure out the optimal chunking scheme.)
  2. Calculating climatological averages over time, per hour of the day and per calendar month.

I like ERA5 surface because it’s relatively high-resolution (and thus makes pretty pictures) and is also quite relatable – most humans have some understanding of surface weather!

My goal is to show how Xarray-Beam could be useful for solving problems the Pangeo community cares about. So towards that end, I would appreciate feedback and suggestions. For example, are there alternative benchmarking tasks and/or datasets that I should be considering instead? I would be particularly interested in cases where we could compare performance to Dask or another distributed computing engine.

2 Likes

There’s the Sentinel 2 public dataset on AWS, as another obvious use case. Or Landsat.

Google mentioned Apache Beam to me as a possibility - so I’d be interested in time/cost comparisons between that and other workflows.

@shoyer do you have any sense of pricing/cost for similar workflows that might be executed via a “traditional” dask cluster versus Beam? I think this might be a really useful set of metrics that could feed into your comparisons with dask or other distributed computing engines.

I’m on paternity leave for the next 2-3 weeks and would be happy to sprint on this with you if you have time.

@darothen I would love to get some comparisons to Dask for the same workload! I don’t have a clear answer for you on cost, but my expectation is that it should be in roughly the same ballpark, depending on lots of little details. I have two examples (climatology calculation and rechunking) on this ERA5 dataset worked out for Xarray-Beam that might be interesting to port to Dask: xarray-beam/examples at main · google/xarray-beam · GitHub

My general impression is that the Dask scheduler is much more “clever” than Beam scheduler (or rather, the scheduler behind Beam runners like the Cloud Dataflow). Dask has a lower-level representation of workflows (individual tasks), whereas Beam keeps things as higher-level (e.g., GroupByKey and Map). This means Dask has more opportunities for clever automatic optimizations, but also more opportunities for things to go wrong.

There is growing interested in implementing a Dask Runner for Beam. This would make it a lot easier for people with existing Dask infrastructure to try Beam.

In order to kickstart the discussion of implementing a Dask Beam runner, I propose we meet during the week of June 13-17. I have created a When2Meet Poll here - Dask Beam Runner Discussion - When2meet . If you are interested in attending, please give your availability. Hope to see many people there! :rocket:

I can’t make that week (conferences) but would like to be involved in future meetings

Thanks to all who replied! We have scheduled the call for Wed June 15, 1:30 pm ET. The zoom link is Launch Meeting - Zoom

Looking forward to the discussion!

Deepak, we will take notes and share them.