Using ocean.pangeo.io for the CMIP6 Hackathon

All hackathon participants are invited to use ocean.pangeo.io for their project work. This is a JupyterHub environment running in Google Cloud which has direct access to the CMIP6 Cloud Data.

Logging In

Anyone is technically able to access ocean.pangeo.io. You just need a few free accounts. Your login is based on your ORCID:

ORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities

We use your ORCID as username to identify you on the cluster. You will also need a Globus account that has been linked to your ORCID account. Globus is the actual identity provider for the cluster.

When you first point your browser to ocean.pangeo.io, you will see something like this:

Click the “Sign in with Globus” button. You will then see something like this:

You should click “Sign in with ORCID iD” and finish going through the subsequent steps to link your ORCID and Globus accounts. Eventually, you will be redirected back to ocean.pangeo.io. You will see something like this:

On this page, you choose the options (software and hardware environment) for your own private virtual machine in the cloud. Choose the “CMIP6 Hackathon Participants” option and click “Spawn.” Now your server will begin to boot up. This might be very fast (~15 seconds) or, depending on the cluster load, could take a few minutes. While the server is booting, you’ll see something like this:

If you want more details about what is happening, click “Event Log.” If this times out or produces an error, just refresh the page and try again.

Once this finishes, you will be placed into a JupyterLab session. For more information about the JupyterLab environment, consult the JupyterLab Docs

To shut down your server, use the “Hub Control Panel” from the “File” menu.

Your session will automatically shut down after a period of inactivity.

Your Home Directory

The cloud environment differs from what many HPC users expect in some ways. Your are not on a shared server; you are on your own private server. Your username is jovyan, and your home directory is /home/jovyan. This is the same for all users.

Your home directory is intended only for notebooks, analysis scripts, and small (< 1GB) datasets. It is not an appropriate place to store large datasets. No one else can see or access the files your home directory.

The easiest way to move files in and out of your home directory is via the JupyterLab web interface. Drag a file into the file browser to upload, and right-click to download back out. You can also open a terminal via the JupyterLab launcher and use this to ssh / scp / ftp to remote systems. However, you can’t ssh in!

The recommended way to move code in and out is via git. You should clone your project repo from the ocean.pangeo.io terminal and use git pull / git push to update and push changes.

Managing SSH Keys

If you have two-factor authentication enabled on your GitHub account, you will probably want to place an SSH key in your home directory to facilitate easy pushes. Read the docs on Connecting to GitHub with SSH for more info. We recommend creating a new key just for this purpose and using a password. You then add this key to your github profile at https://github.com/settings/keys.

To get the key to work on ocean, place it in the /home/jovyan/.ssh/ directory. Then run

$ ssh-agent bash
$ ssh-add ~/.ssh/<name_of_rsa_key>

Managing Packages

The software environment on ocean.pangeo.io is configured via this GitHub repository:

https://github.com/pangeo-data/pangeo-cloud-federation/tree/staging/deployments/ocean/image/binder

A large number of ocean, weather, and climate-related packages have already been installed. If you wish to suggest an update to this config, you can make a pull request to modify the file https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/ocean/image/binder/environment.yml

You can also install extra packages in your own environment using pip or conda. However, there are some limitations to this:

  • Any installations you make will disappear when your server shuts down. (This helps prevent you from permanently breaking your environment.)
  • The new packages are not installed on Dask workers spawned from server. (This can possibly be fixed by passing special arguments to KubeCluster.)

Working with Datasets

One challenge when working on the cloud is that, unlike with HPC, we don’t have access to a big, shared filesystem. Instead, we have access to a big, shared object store. On Google Cloud, this is called Google Cloud Storage (GCS). While in principle these do the same thing (allow many people to access the same data with high bandwidth), there are some big practical differences. For climate science, the most important difference is that tradition netCDF files can’t be efficiently read directly from GCS, like they can from a filesystem.

To solve this problem, on the cloud we are using a new storage format called Zarr. We wrote about this heavily on the Pangeo Website. The existing Zarr datasets, including CMIP6, are described in the Pangeo Cloud Datastore website. Zarr datasets can be read by Xarray and look effectively identical to netCDF datasets. With this small tweak, we achieve excellent performance with GCS.

CMIP6 Data

A listing of all the CMIP6 data in google cloud is listed at

https://storage.googleapis.com/pangeo-cmip6/pangeo-cmip6-zarr-consolidated-stores.csv

You can open this catalog directly as a pandas dataframe by running


import pandas as pd
cat_df = pd.read_csv('https://storage.googleapis.com/pangeo-cmip6/pangeo-cmip6-zarr-consolidated-stores.csv')
cat_df.head()

This is an interactive catalog browser at https://pangeo-data.github.io/pangeo-datastore/cmip6_pangeo.html.

Here is an example of using intake-esm to search and load the data:

import intake

# open ESMCol catalog
cat_url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
col = intake.open_esm_datastore(cat_url)

# search and display data
cat = col.search(experiment_id=['historical', 'ssp585'], table_id='Oyr', variable_id='o2',
                 grid_label='gn')
cat.df

# load to a dictionary of xarray Datasets - may take a minute
dset_dict = cat.to_dataset_dict(zarr_kwargs={'consolidated': True})
list(dset_dict.keys())

# look at a particular dataset
ds = dset_dict['CMIP.CCCma.CanESM5.historical.Oyr.gn']
ds

Other Datasets

If you wish to access data that are not already in Zarr on GCS, you have several options:

  • Access the data via OpenDAP, e.g. via a THREDDS or ERDAP server. This is the recommended option for datasets from established providers like NOAA and NASA.* OpenDAP links can be opened directly from Xarray.
  • Download the netCDF files into your home directory. This is only appropriate for small data (e.g. < 1GB). Such data can’t be access from Dask workers.
  • Create a Zarr copy of the dataset and place it in your own cloud-storage bucket. Google will give you a $300 free credit for signing up for a cloud account. This is enough to store 1 TB of data for one year in GCS.
  • Add the dataset to a Pangeo-owned GCS bucket and catalog. This can be proposed via an issue at the following GitHub repository

Dask Clusters

On ocean.pangeo.io, you have the ability to create a dask cluster on demand Dask Kubernetes.

This can be done from python code

from dask_kubernetes import KubeCluster

# create cluster
cluster = KubeCluster() # use default configuration
cluster.adapt(minimum=1, maximum=20) # adaptively scale cluster

# connect to it
from dask.distributed import Client
client = Client(cluster)
client

Note: it can take the worker nodes up to 10 minutes to start up, depending on cluster load.

It can also be done interactively via Dask labextension. First click on the little Dask logo on the far right. Then click on the “+ NEW” button.

This will create a new cluster, which will show up below as such. Then click the green “SCALE” button to choose the size of your cluster.

20 workers is a good upper limit for most CMIP6 analysis.

Finally, connect to your cluster by clicking the < > button. This will inject code into your notebook like

from dask.distributed import Client

client = Client("tcp://10.32.13.17:41285")
client

(But the address and port number will be different.)

Try to use adaptive mode whenever possible. If you don’t use adaptive mode, make sure you manually shut down / scale down your cluster so as not to consume credits wastefully.

1 Like