This post will serve as the official operating guide for hackathon participants.
Table of contents:
- Team structure
- Git / GitHub Setup
- Cloud vs Cheyenne, how to choose?
- How to browse the data catalogs and load data
- Guidelines for usage of dask
- Pointers to other useful libraries
- Final deliverables
A team is a group of people collaborating together on one of the projects listed in #cmip6hack:cmip6hack-projects. We imagine that project teams will have between 2 and 10 people. Teams may be split across multiple sites or all at one site.
Each team will have a de-facto team leader–probably whoever originally proposed the project. This person will have a few special roles, such as being the owner of the project GitHub repository. The leader will probably want to create a slack channel in the cmip6hackers.slack.com workspace to facilitate communication. Teams are strongly encouraged to be inclusive and supportive of participants with different levels of skills and experience. Some participants will have more programming skills while other may have more scientific domain knowledge.
Your team should think carefully about how to divide up the project work efficiently. How can you work in parallel, rather than in serial, so that each person / subgroup can contribute? Many projects will look something like this:
- Search and load a set of datasets from the data catalog
- Subset the data by time / region
- Perform some fancy data analysis methods which transform / reduce the data
- Make some figures to visualize the results
In this scenario, one sub-group could start working on 3 (developing and fine-tuning the analysis methods) immediately using just a small subset of data while another group works on items 1 and 2.
All project code should be stored in a git repository and shared via GitHub. All team members should make sure they have a GitHub account prior to arriving at the hackathon. (Students should make sure to take advantage of the GitHub for education packs.) Participants who are unfamiliar with git are recommended to review the software carpetry Git novice tutorial
We recommend that all projects that plan to analyze CMIP6 data follow our project template
This template contains the data catalogs for both Cheyenne and Google Cloud and some examples to get you started. (If you are working primarily on tool development rather than CMIP6 data analysis, it might not make sense for you to follow the template; use your best judgement.)
The project leader should use the template to create a new GitHub repo under their own GitHub account.
There are two ways for the team to collaborate via git:
- The leader may wish to grant access to collaborators so other team members can push directly to the repository.
- Or the team may decide to have each team member fork the leader’s repo and make updates via pull requests.
Which one you choose depends on your team’s experience level and comfort with git. If no one on your team has ever used pull requests before, probably go with option 1.
Regardless of which option you choose, you are encouraged to commit to your repo often. You should make a commit every time you make an increment of progress. This will help your team stay in sync.
Beyond git, you are encouraged to use shared online docs (e.g. google docs, hackmd, etc.) for writing notes and sharing slides.
A central idea for this hackathon is to use a large, shared computing resource where the CMIP6 data are already downloaded, rather than your personal computer. Your team will need to choose which environment you want to use: NCAR’s Cheyenne Supercomputer or Pangeo’s Google-Cloud-based cluster ocean.pangeo.io.
Your choice will likely depend on which data you want to access. At the highest level, the cloud environment is more flexible, but the Cheyenne environment has more data. It’s recommended you use the online CMIP6 data browser (follow links below) to check and see if your desired data are present in the environment of your choice.
|CMIP6 data||~100TB subset of CMIP6||Larger subset of CMIP6|
|Other data||Pangeo Cloud Datastore + OpenDAP||NCAR Research Data Archive|
|Queue||No queue, immediate launching of compute nodes||Have to wait in the Cheyenne queue|
|Connectivity||Environment fully connected to the internet||No internet connectivity|
Before beginning to hack, you should make sure everyone on your team has logged in and has a fully functional environment on your platform of choice.
Using ocean.pangeo.io (Google Cloud Environment)
Please refer to this post regarding accessing the Google Cloud environment:
Please refer to this post regarding how to access Cheyenne for the hackathon:
The data files on each platform are cataloged in a simple CSV file, following the ESM Collection Specification. The template repo contains links to these files in the
catalogs/ directory. For completeness, we list them here as well:
- Cloud: https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv
You can explore these files directly if you wish. For example, this example notebook used Pandas to read the CSV and filter it to find the desired datasets.
An alternative approach which offers more convenience, but also more abstraction, is to use intake-esm to help with searching the data catalogs and loading the data. Intake-esm can automatically merge compatible data files (such as different variables from the same model) into a single xarray dataset. There is an example of this in the project template repo.
Link to post or slides from Anderson.
Dask is available on both cloud and Cheyenne to help parallelize large calculations. All environments support the standard multi-threaded dask scheduler, and by default, the datasets will open as dask-backed xarray datasets. Users can create dask distributed scheduelrs which can take advantage of many compute nodes. Cloud users should use Dask Kubernetes, while Cheyenne users should use Dask Jobqueue.
Some general guidelines for using Dask.
- Familiarize yourself with Dask best practices.
- Don’t use Dask! Or more specifically, only use a distributed cluster if you really need it, i.e. if your calculations are running out of memory or are taking an unacceptably long time to complete. For example, analysis using only monthly-mean surface values shouldn’t need a distributed cluster.
- Start small; work on a small subset of your problem to debug before scaling up to a very large dataset.
- If you use a distributed cluster, use Adapative mode rather than a fixed size cluster; this will help share resources more effectively.
- Use the Dask dashboard heavily to monitor the activity of your cluster.
This topic will be addressed in a follow-up post.
Your final output will be:
- A [hopefully] working GitHub repository of analysis notebooks
- A brief presentation about your outcomes to the plenary session
We expect that the code will be licensed using an open-source license which allows other to build on your results. However, we also encourage you to obtain a DOI for your repository using Zenodo. This will enable others to properly cite and attribute your work! In many cases, we expect that the preliminary results from the hackathon will go on to become peer-reviewed publications. In this case, authorship should be offered to everyone on the project team. Such publications should acknowledge support from Pangeo (NSF award 1740648) and other hackathon sponsors (TBA).