I released a new “Xarray-Beam” project on GitHub last week: GitHub - google/xarray-beam: Distributed Xarray with Apache Beam
The idea is to facilitate a different model for large-scale distributed analytics in the Cloud, building on Apache Beam as an alternative to Dask. I’m still working on documenting it, but hopefully the README gives a reasonable overview of the idea.
One thing I’d love to include are a handful of end-to-end examples of large scale data-processing that run out of the box on Google Cloud Dataflow. These should capture the flavor of important data processing workflows for working with weather/climate data.
I’m currently thinking of two demos, based on Pangeo’s publicly available ERA5-surface dataset (17 data-variables adding up to 25 TB total):
- Rechunking, from “stack of images” to “time-series” format. (Xarray-Beam leverages Rechunker internally to figure out the optimal chunking scheme.)
- Calculating climatological averages over time, per hour of the day and per calendar month.
I like ERA5 surface because it’s relatively high-resolution (and thus makes pretty pictures) and is also quite relatable – most humans have some understanding of surface weather!
My goal is to show how Xarray-Beam could be useful for solving problems the Pangeo community cares about. So towards that end, I would appreciate feedback and suggestions. For example, are there alternative benchmarking tasks and/or datasets that I should be considering instead? I would be particularly interested in cases where we could compare performance to Dask or another distributed computing engine.