Large-scale data processing benchmarks for Xarray-Beam

I released a new “Xarray-Beam” project on GitHub last week: GitHub - google/xarray-beam: Distributed Xarray with Apache Beam

The idea is to facilitate a different model for large-scale distributed analytics in the Cloud, building on Apache Beam as an alternative to Dask. I’m still working on documenting it, but hopefully the README gives a reasonable overview of the idea.

One thing I’d love to include are a handful of end-to-end examples of large scale data-processing that run out of the box on Google Cloud Dataflow. These should capture the flavor of important data processing workflows for working with weather/climate data.

I’m currently thinking of two demos, based on Pangeo’s publicly available ERA5-surface dataset (17 data-variables adding up to 25 TB total):

  1. Rechunking, from “stack of images” to “time-series” format. (Xarray-Beam leverages Rechunker internally to figure out the optimal chunking scheme.)
  2. Calculating climatological averages over time, per hour of the day and per calendar month.

I like ERA5 surface because it’s relatively high-resolution (and thus makes pretty pictures) and is also quite relatable – most humans have some understanding of surface weather!

My goal is to show how Xarray-Beam could be useful for solving problems the Pangeo community cares about. So towards that end, I would appreciate feedback and suggestions. For example, are there alternative benchmarking tasks and/or datasets that I should be considering instead? I would be particularly interested in cases where we could compare performance to Dask or another distributed computing engine.

2 Likes

There’s the Sentinel 2 public dataset on AWS, as another obvious use case. Or Landsat.

Google mentioned Apache Beam to me as a possibility - so I’d be interested in time/cost comparisons between that and other workflows.

@shoyer do you have any sense of pricing/cost for similar workflows that might be executed via a “traditional” dask cluster versus Beam? I think this might be a really useful set of metrics that could feed into your comparisons with dask or other distributed computing engines.

I’m on paternity leave for the next 2-3 weeks and would be happy to sprint on this with you if you have time.

@darothen I would love to get some comparisons to Dask for the same workload! I don’t have a clear answer for you on cost, but my expectation is that it should be in roughly the same ballpark, depending on lots of little details. I have two examples (climatology calculation and rechunking) on this ERA5 dataset worked out for Xarray-Beam that might be interesting to port to Dask: xarray-beam/examples at main · google/xarray-beam · GitHub

My general impression is that the Dask scheduler is much more “clever” than Beam scheduler (or rather, the scheduler behind Beam runners like the Cloud Dataflow). Dask has a lower-level representation of workflows (individual tasks), whereas Beam keeps things as higher-level (e.g., GroupByKey and Map). This means Dask has more opportunities for clever automatic optimizations, but also more opportunities for things to go wrong.