Earthcube Annual Meeting call for abstracts (due Apr. 15)

We also submitted a notebook abstract, posting here for reference:

Multi-Cloud workflows with Pangeo and Dask Gateway

Tom Augspurger (Anaconda), Martin Durant (Anaconda), Ryan Abernathey (Columbia University / Lamont Doherty Earth Observatory)

As more analysis-ready datasets are provided on the cloud, we need to consider how researchers access data. To maximize performance and minimize costs, we move the analysis to the data. This notebook demonstrates a Pangeo deployment connected to multiple Dask Gateways to enable analysis, regardless of where the data is stored.

Public clouds are partitioned into regions, a geographic location with a cluster of data centers. A dataset like the National Water Model Short-Range Forecast is provided in a single region of some cloud provider (e.g. AWS’s us-east-1).

To analyze that dataset efficiently, we do the analysis in the same region as the dataset. That’s especially true for very large datasets. Making local “dark replicas” of the datasets is slow and expensive.

In this notebook we demonstrate a few open source tools to compute “close” to cloud data. We use Intake as a data catalog, to discover the datasets we have available and load them as an xarray Dataset. With xarray, we’re able to write the necessary transformations, filtering, and reductions that compose our analysis. To process the large amounts of data in parallel, we use Dask.

Behind the scenes, we’ve configured this Pangeo deployment with multiple Dask Gateways, which provide a secure, multi-tenant server for managing Dask clusters. Each Gateway is provisioned with the necessary permissions to access the data.

By placing compute (the Dask workers) in the same region as the dataset, we achieve the highest performance: these worker machines are physically close to the machines storing the data and have the highest bandwidth. We minimize cost by avoiding egress costs: fees charged to the data provider when data leaves a cloud region.

We’re hopeful that this notebook demonstrates a setup for efficiently analyzing large analysis-ready datasets on the cloud, regardless of where the dataset lives.

1 Like

We submitted a notebook abstract as well. Here’s our submission:

Intake / Pangeo Catalog: Making It Easier To Consume Earth’s Climate and Weather Data

Anderson Banihirwe (National Center for Atmospheric Research), Charles Blackmon-Luca (Columbia University / Lamont Doherty Earth Observatory), Ryan Abernathey (Columbia University / Lamont Doherty Earth Observatory), Joe Hamman (National Center for Atmospheric Research)

Computer simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on HPC systems or in the cloud across multiple data assets of a variety of formats (netCDF, zarr, etc…). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.

In this notebook, we demonstrate the integration of data discovery tools such as intake and intake-esm (an intake plugin) with data stored in cloud optimized formats (zarr). We highlight (1) how these tools provide transparent access to local and remote catalogs and data, (2) the API for exploring arbitrary metadata associated with data, loading data sets into data array containers.

We also showcase the Pangeo catalog, an open source project to enumerate and organize cloud optimized climate data stored across a variety of providers, and a place where several intake-esm collections are now publicly available. We use one of these public collections as an example to show how an end user would explore and interact with the data, and conclude with a short overview of the catalog’s online presence.

1 Like