Hackathon Operating Guide

rabernat · October 15, 2019, 7:05pm

This post will serve as the official operating guide for hackathon participants.

Table of contents:

Team structure
Git / GitHub Setup
Cloud vs Cheyenne, how to choose?
How to browse the data catalogs and load data
Guidelines for usage of dask
Pointers to other useful libraries
Final deliverables

Project Team Structure

A team is a group of people collaborating together on one of the projects listed in CMIP6 Project Proposals. We imagine that project teams will have between 2 and 10 people. Teams may be split across multiple sites or all at one site.

Each team will have a de-facto team leader–probably whoever originally proposed the project. This person will have a few special roles, such as being the owner of the project GitHub repository. The leader will probably want to create a slack channel in the cmip6hackers.slack.com workspace to facilitate communication. Teams are strongly encouraged to be inclusive and supportive of participants with different levels of skills and experience. Some participants will have more programming skills while other may have more scientific domain knowledge.

Your team should think carefully about how to divide up the project work efficiently. How can you work in parallel, rather than in serial, so that each person / subgroup can contribute? Many projects will look something like this:

Search and load a set of datasets from the data catalog
Subset the data by time / region
Perform some fancy data analysis methods which transform / reduce the data
Make some figures to visualize the results

In this scenario, one sub-group could start working on 3 (developing and fine-tuning the analysis methods) immediately using just a small subset of data while another group works on items 1 and 2.

Git / GitHub Setup

All project code should be stored in a git repository and shared via GitHub. All team members should make sure they have a GitHub account prior to arriving at the hackathon. (Students should make sure to take advantage of the GitHub for education packs.) Participants who are unfamiliar with git are recommended to review the software carpetry Git novice tutorial

We recommend that all projects that plan to analyze CMIP6 data follow our project template

This template contains the data catalogs for both Cheyenne and Google Cloud and some examples to get you started. (If you are working primarily on tool development rather than CMIP6 data analysis, it might not make sense for you to follow the template; use your best judgement.)

The project leader should use the template to create a new GitHub repo under their own GitHub account.

There are two ways for the team to collaborate via git:

The leader may wish to grant access to collaborators so other team members can push directly to the repository.
Or the team may decide to have each team member fork the leader’s repo and make updates via pull requests.

Which one you choose depends on your team’s experience level and comfort with git. If no one on your team has ever used pull requests before, probably go with option 1.

Regardless of which option you choose, you are encouraged to commit to your repo often. You should make a commit every time you make an increment of progress. This will help your team stay in sync.

Beyond git, you are encouraged to use shared online docs (e.g. google docs, hackmd, etc.) for writing notes and sharing slides.

Cloud vs. Cheyenne: How to Choose

A central idea for this hackathon is to use a large, shared computing resource where the CMIP6 data are already downloaded, rather than your personal computer. Your team will need to choose which environment you want to use: NCAR’s Cheyenne Supercomputer or Pangeo’s Google-Cloud-based cluster ocean.pangeo.io.

Your choice will likely depend on which data you want to access. At the highest level, the cloud environment is more flexible, but the Cheyenne environment has more data. It’s recommended you use the online CMIP6 data browser (follow links below) to check and see if your desired data are present in the environment of your choice.

	Google Cloud	Cheyenne
CMIP6 data	~100TB subset of CMIP6	Larger subset of CMIP6
Other data	Pangeo Cloud Datastore + OpenDAP	NCAR Research Data Archive
Access	ORCID	CISL Account
Queue	No queue, immediate launching of compute nodes	Have to wait in the Cheyenne queue
Connectivity	Environment fully connected to the internet	No internet connectivity

Before beginning to hack, you should make sure everyone on your team has logged in and has a fully functional environment on your platform of choice.

Using ocean.pangeo.io (Google Cloud Environment)

Please refer to this post regarding accessing the Google Cloud environment:

Using Cheyenne

Please refer to this post regarding how to access Cheyenne for the hackathon:

How to Browse Data Catalogs and Load Data

The data files on each platform are cataloged in a simple CSV file, following the ESM Collection Specification. The template repo contains links to these files in the catalogs/ directory. For completeness, we list them here as well:

Glade: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.csv.gz
Cloud: https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv

You can explore these files directly if you wish. For example, this example notebook used Pandas to read the CSV and filter it to find the desired datasets.

An alternative approach which offers more convenience, but also more abstraction, is to use intake-esm to help with searching the data catalogs and loading the data. Intake-esm can automatically merge compatible data files (such as different variables from the same model) into a single xarray dataset. There is an example of this in the project template repo.

Link to post or slides from Anderson.

Using Dask

Dask is available on both cloud and Cheyenne to help parallelize large calculations. All environments support the standard multi-threaded dask scheduler, and by default, the datasets will open as dask-backed xarray datasets. Users can create dask distributed scheduelrs which can take advantage of many compute nodes. Cloud users should use Dask Kubernetes, while Cheyenne users should use Dask Jobqueue.

Some general guidelines for using Dask.

Familiarize yourself with Dask best practices.
Don’t use Dask! Or more specifically, only use a distributed cluster if you really need it, i.e. if your calculations are running out of memory or are taking an unacceptably long time to complete. For example, analysis using only monthly-mean surface values shouldn’t need a distributed cluster.
Start small; work on a small subset of your problem to debug before scaling up to a very large dataset.
If you use a distributed cluster, use Adapative mode rather than a fixed size cluster; this will help share resources more effectively.
Use the Dask dashboard heavily to monitor the activity of your cluster.

Pointers to Useful Libraries

This topic will be addressed in a follow-up post.

Final Deliverables and Next Steps

Your final output will be:

A [hopefully] working GitHub repository of analysis notebooks
A brief presentation about your outcomes to the plenary session

We expect that the code will be licensed using an open-source license which allows other to build on your results. However, we also encourage you to obtain a DOI for your repository using Zenodo. This will enable others to properly cite and attribute your work! In many cases, we expect that the preliminary results from the hackathon will go on to become peer-reviewed publications. In this case, authorship should be offered to everyone on the project team. Such publications should acknowledge support from Pangeo (NSF award 1740648) and other hackathon sponsors (TBA).

andersy005 · October 16, 2019, 12:23pm

@rabernat,

For the intake-esm links,

https://intake-esm.readthedocs.io/en/latest/notebooks/tutorial.html, and
https://andersonbanihirwe.dev/talks/intake-esm-cmip6-2019.html might be useful.

lettie-roach · October 16, 2019, 4:37pm

Link 11 seems to be broken - the example notebook for reading the CSV catalog

rabernat · October 16, 2019, 4:57pm

I can see that! Unfortunately it looks like a problem with nbviewer and our of our hands…

A link to the original repo and an interactive binder can be found in this post:
https://discourse.pangeo.io/t/cloud-example-3hr-precip-frequency-distribution/155/2

Topic		Replies	Views
About the CMIP6 Hackathon category CMIP6 Hackathon	1	1023	September 19, 2019
About the CMIP6 Project Proposals category CMIP6 Project Proposals	0	634	September 12, 2019
Good with git? Pythia Hackathon 2025 needs Facilitators! Limited travel funds available Project Pythia	0	35	June 25, 2025
Using ocean.pangeo.io for the CMIP6 Hackathon CMIP6 Hackathon	1	1550	November 5, 2019
The CMIP6 Data for our Hackathon CMIP6 Hackathon	7	3082	March 24, 2020