Cloud Example: 3hr Precip Frequency Distribution

I know many people are curious about the Pangeo cloud-based environment and what a real hackathon project might look like there. We are working on a comprehensive contributor guide, which will give some guidelines on how to structure your project, best practices for working with data in the cloud, repo templates etc. However, there are a few details to be worked out regarding the data catalog, and we aren’t quite ready to release this guide yet.

In the meantime, to whet your appetite, I have created a bare bones demo notebook of a semi-realistic workflow.

The calculation was inspired by @apendergrass’s work on precipitation statistics (e.g. this paper or this website).

This example includes:

  • Searching the data catalog and finding all available models (technically source_ids) with 3-hourly precip data, historical and ssp585 experiments. (Only four at this point.)
  • Calculating the zonal-mean precipitation histograms using the xhistogram package, using dask to speed up and parallelize the calculation
  • Visualizing the changes under a global warming scenario.

The results for one model look something like this:

I don’t have enough expertise on this topic to know whether this is a scientifically interesting calculation, but it makes a decent demo. In particular, it shows how easy it is to work with very high-frequency 3-hourly data in the cloud environment. The whole calculation takes just a couple of minutes.

This example is available as a binder, so you can try it yourself.

I hope this demo helps clarify the sort of workflow we will be using for the hackathon projects.


Very helpful notebook! It has helped me run CMIP6 analyses on my own computer.

I’m new to Dask and Pangeo and am working to set these up with the Pangeo Cloud computing resources. Could you explain the difference in approach for setting up a cluster here by using KubeCluster and cluster.adapt() versus setting up a cluster in the demo notebooks for Pangeo that instead use Gateway and cluster.scale()?