Earthcube Annual Meeting call for abstracts (due Apr. 15)

TomAugspurger · April 15, 2020, 9:00pm

We also submitted a notebook abstract, posting here for reference:

Multi-Cloud workflows with Pangeo and Dask Gateway

Tom Augspurger (Anaconda), Martin Durant (Anaconda), Ryan Abernathey (Columbia University / Lamont Doherty Earth Observatory)

As more analysis-ready datasets are provided on the cloud, we need to consider how researchers access data. To maximize performance and minimize costs, we move the analysis to the data. This notebook demonstrates a Pangeo deployment connected to multiple Dask Gateways to enable analysis, regardless of where the data is stored.

Public clouds are partitioned into regions, a geographic location with a cluster of data centers. A dataset like the National Water Model Short-Range Forecast is provided in a single region of some cloud provider (e.g. AWS’s us-east-1).

To analyze that dataset efficiently, we do the analysis in the same region as the dataset. That’s especially true for very large datasets. Making local “dark replicas” of the datasets is slow and expensive.

In this notebook we demonstrate a few open source tools to compute “close” to cloud data. We use Intake as a data catalog, to discover the datasets we have available and load them as an xarray Dataset. With xarray, we’re able to write the necessary transformations, filtering, and reductions that compose our analysis. To process the large amounts of data in parallel, we use Dask.

Behind the scenes, we’ve configured this Pangeo deployment with multiple Dask Gateways, which provide a secure, multi-tenant server for managing Dask clusters. Each Gateway is provisioned with the necessary permissions to access the data.

By placing compute (the Dask workers) in the same region as the dataset, we achieve the highest performance: these worker machines are physically close to the machines storing the data and have the highest bandwidth. We minimize cost by avoiding egress costs: fees charged to the data provider when data leaves a cloud region.

We’re hopeful that this notebook demonstrates a setup for efficiently analyzing large analysis-ready datasets on the cloud, regardless of where the dataset lives.

andersy005 · April 15, 2020, 9:52pm

We submitted a notebook abstract as well. Here’s our submission:

Intake / Pangeo Catalog: Making It Easier To Consume Earth’s Climate and Weather Data

Anderson Banihirwe (National Center for Atmospheric Research), Charles Blackmon-Luca (Columbia University / Lamont Doherty Earth Observatory), Ryan Abernathey (Columbia University / Lamont Doherty Earth Observatory), Joe Hamman (National Center for Atmospheric Research)

Computer simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on HPC systems or in the cloud across multiple data assets of a variety of formats (netCDF, zarr, etc…). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.

In this notebook, we demonstrate the integration of data discovery tools such as intake and intake-esm (an intake plugin) with data stored in cloud optimized formats (zarr). We highlight (1) how these tools provide transparent access to local and remote catalogs and data, (2) the API for exploring arbitrary metadata associated with data, loading data sets into data array containers.

We also showcase the Pangeo catalog, an open source project to enumerate and organize cloud optimized climate data stored across a variety of providers, and a place where several intake-esm collections are now publicly available. We use one of these public collections as an example to show how an end user would explore and interact with the data, and conclude with a short overview of the catalog’s online presence.

Topic		Replies	Views
Call for abstracts: Building Upon the EarthCube Community: A Geoscience and Cyberinfrastructure Workshop News & Announcements	2	567	October 27, 2023
Call For Notebooks 2022 EarthCube Annual Meeting News & Announcements	0	370	February 24, 2022
Registration Now Open - Join Us at the "Building Upon the EarthCube Community" Workshop! News & Announcements	0	324	May 25, 2023
Jupyter meets the Earth: EarthCube meeting July 27 News & Announcements	7	2241	August 19, 2020
[AGU] Call for abstracts: Next Generation of Advanced Visualization in Earth Science Visualization	0	337	July 27, 2023

Earthcube Annual Meeting call for abstracts (due Apr. 15)

Multi-Cloud workflows with Pangeo and Dask Gateway

Intake / Pangeo Catalog: Making It Easier To Consume Earth’s Climate and Weather Data

Related topics