Get Involved in Pangeo: Entry Points for New Contributors

rabernat · June 10, 2020, 1:43pm

The success of Pangeo derives from the diversity of our contributors. We have successfully assembled a community that crosses traditional disciplinary boundaries, and this has enabled us to do some innovative stuff. However, as the project has grown, our activities have sprawled across dozens of GitHub repos, making it hard to identify what needs to be done and where new contributors can have an impact.

In addition to disciplinary diversity, we also must continue to tackle other dimensions of diversity, particularly gender and race. I’m proud of the steps our community has taken in this direction. The first paragraph of our code of conduct reads

We strive to be a community that welcomes and supports people of all backgrounds and identities. This includes, but is not limited to, members of any race, ethnicity, culture, national origin, color, immigration status, social and economic class, educational level, sex, sexual orientation, gender identity and expression, age, physical appearance, family status, technological or professional choices, academic discipline, religion, mental ability, and physical ability.

Being welcoming is a first step. But we must do more to actively recruit and support diverse contributors to Pangeo. This will benefit our project of course, but it is also a concrete action we can take to combat systematic racism. (See the ShutdownSTEM post for more context.) Let’s use Pangeo as a vehicle to help members of underrepresented groups build their skills and gain recognition in the field of geoscience / big data / software engineering.

Let’s use this thread as a place to collect potential projects for new contributors.
Let’s also collect information about internships, fellowships, etc. that can provide paid support to such contributors. While some people may be able to volunteer, we should not assume that everyone has this privilige.

Template

Please try to use this template for all posts.

# Project Title

link to GitHub issue (recommended)

## Description

One or two paragraph description of the project.

## Required Skills

What technical skills are needed in order to contribute? For example
- Basic python programming
- Some familiarity with kubernetes

## Mentors

(All projects need at least one mentor who is willing to help out the new contributors.)

- Name | Email Address

For Potential Contributors

Please email the mentor to express interest in a project and learn more about how to get started.

rabernat · June 10, 2020, 2:13pm

Matrix of Kubespawner `profile_list` Options

Description

Our cloud-based Jupyter hubs allow users to choose among different options for the environment in which the notebook server will run. For example, on ocean.pangeo.io, we see

These options, called profile_list are passed to kubespawner (see docs). They include both hardware (CPU, memory, etc.) and software (specifically the docker image to use). We would like to be able to separate the hardware part from the software part. This would require making some changes to the kubespawner package to enable more flexible configuration of profiles.

Required Skills

Intermediate python programming
Basic HTML

Mentors

Ryan Abernathey | rpa@ldeo.columbia.edu

rabernat · June 10, 2020, 2:25pm

Contribute Example Notebooks to Pangeo Gallery

http://gallery.pangeo.io/contributing.html

Pangeo gallery is our new approach to sharing reproducible scientific content in the cloud. We are always looking for more examples of how to apply Pangeo tools (e.g. Xarray, Dask, etc.) to real-world scientific problems. If you already use these tools, creating an example gallery is a great way to get started as a new contributor.

Required Skills

Some domain scientific knowledge (e.g. oceanography, atmospheric science)
Basic scientific programming
Familiarity with Jupyter notebooks
Comfortable working with git / github

Mentors

Ryan Abernathey | rpa@ldeo.columbia.edu

rsignell · June 15, 2020, 1:10pm

Create an app for viewing terrain data using Xarray-spatial

Xarray-spatial is a new high performance package for raster-based spatial analysis for Python. The USGS has a large collection of terrain data in raster format, and is pushing this data to AWS in COG format (example here). This project would build a dashboard for exploring terrain data in Python using xarray-spatial and Panel, a high-level app and dashboarding solution for Python.

composite_map

Required Skills

Basic knowledge of Python
Willingness to work on a cool project!

Mentors

Rich Signell | rsignell@usgs.gov

bradyrx · June 23, 2020, 4:31pm

Contribute to the `climpred` package for analyzing climate predictions

climpred is a package that uses pangeo-supported software like xarray and dask to make evaluating climate predictions easier. Many institutions are running climate models similar to a weather model to predict the Earth system anywhere from 2 weeks to decades in advance. These projects produce massive datasets and require users to tediously write code to assess how well the forecasts did. climpred automates a lot of the analysis (like aligning forecast times with real-world times and computing statistical metrics) so that users can get right to answering the scientific questions they care about.

What to contribute

Look for tags “Help Wanted”, “ASP Projects”, or “Good First Issue”

Required Skills

What technical skills are needed in order to contribute?

Intermediate python programming
Basic knowledge of git (Although see https://climpred.readthedocs.io/en/stable/contributing.html for instructions on how to contribute)
Some domain-specific knowledge of forecasting/forecast evaluation

Note that Aaron and I are eager to mentor anyone who is a first time contributor. The code review is friendly and you’ll learn a lot from the process. Feel free to email us if you have ideas or just want to help out and want some guidance on where to get started.

Mentors

Riley Brady (PhD candidate at CU Boulder)
riley.brady@colorado.edu
https://www.github.com/bradyrx
Aaron Spring (PhD candidate at MPI in Hamburg, Germany)
@aaronspring
aaron.spring@mpimet.mpg.de
https://www.github.com/aaronspring

nicholaskgeorge · July 15, 2020, 3:32pm

Making Terrain Data Viewer with Xarray

Hello everyone! My name is Nicholas George and I am currently working on a short internship with USGS to create an interactive viewer for terrain data using Xarray in tandem with Panel and hvPlot. Our goal is to use these tools to make a viewing window for COG data which is both interactive and informative. If you are interested in seeing what we have done, the GitHub repo can be found here

cgentemann · August 27, 2020, 9:11pm

Test cloud optimized data formats for swath (L2) satellite data

Description

As NASA and NOAA data are moving to the cloud, what is the best cloud-optimized format for swath data? NASA has explored different formats and written a report that presents some viable options for swath data. For this project, we will stage a sample of the MODIS L2P data (from PO.DAAC) on Pangeo, transform it to a couple different formats (eg. Zarr, cloud-optimized HDF) and test access and analysis times for a few different likely patterns of analysis, such as collocation with random points that are globally distributed, finding all data within a bounding box, etc.

Required Skills

What technical skills are needed in order to contribute? For example

Basic-Intermediate python programming (Xarray, matplotlib)

Mentors

Chelle Gentemann, cgentemann@faralloninstitute.org

rabernat · September 2, 2020, 2:33pm

NGINX Proxy for Cloud Storage

Description

In Pangeo, we love using Zarr as a cloud optimized data format, and are trying to push data providers to start serving data in Zarr directly from cloud object storage. So far, our Zarr cloud datasets have been either

Totally public (no authentication required at all)
Requester pays (requiring credentials from the cloud provider, but any credentials will do)

However, many data providers (e.g. NASA) want to have more fine-grained control over who can access which datasets. They also want logging of who is downloading their data. While this is theoretically possible using IAM roles, that is probably not scalable. These services can have thousands of users, and creating a unique identity for each one using the cloud-provider’s identity system is probably not feasible. It also might present security issues.

So we need some way to manage and restrict access to the underlying object store using an external (e.g. oauth2) identity provider.

My idea is to use NGINX for this. NGINX can pretty easily be configured to proxy cloud storage. Here’s an example:

If we put the NGINX proxies inside an auto-scaling kubernetes cluster, we should be able to scale up and down in response to load to avoid excessive compute charges.

What we would need to add to this would be JWT authentication support. Based on the NGINX docs, that seems relatively straightforward

Logging cloud also be configured to track usage and downloads. Here’s a diagram of how it might work.

Required Skills

Perhaps I’m underestimating the difficulty, but I feel like this would be a < 1 day project for the right person. We’re looking for someone who understands

NGINX configuration
Oauth2 and JWT
Kubernetes

Mentors

Ryan Abernathey | rpa@ldeo.columbia.edu

rabernat · September 2, 2020, 9:02pm

On this last topic, at our latest meeting, people suggested signed URLs, which could solve the problem more elegantly:

jukent · September 24, 2020, 4:22pm

@bradyrx Would you be interested in turning this into a SCIParCS project (Internship Projects)? We can discuss more if you are!

bradyrx · September 24, 2020, 4:36pm

@jukent, yes that would be great. We will be releasing our next version in the next couple weeks as well as a paper to JOSS so it will be in a great place for contributing next summer for SCIParCS. We also just migrated over to pangeo-data.

@aaronspring would have to serve as a mentor most likely. I’m still trying to navigate what IP law and my free time will look like at my new job. But definitely can get this spun up now and then will know in the spring about IP/time.

jukent · September 24, 2020, 4:50pm

Someone at NCAR needs to be the official mentor (likely me), but we would reach out for questions and help much like any other Pangeo project. I would ask you to read the project proposal I write to make sure it is in line with your goals, then I would try to make at least some contribution to get myself spun up on the package (and I will probably need some help at this stage), and then the intern and I could be as independent or collaborate as much as need be with the uncertainty of next summer for you and @aaronspring.

The internship proposals are due October 7th, before you’ll know about your time availability, so we will write it without any promises of engagement from you. Congratulations!

bradyrx · September 24, 2020, 5:01pm

That sounds perfect to me Julia! Let’s go for it.

jukent · September 24, 2020, 5:10pm

Is https://climpred.readthedocs.io/en/stable/contributing.html up to date or is there a new document over at pangeo-data?

bradyrx · September 24, 2020, 5:26pm

Yes the docs are up-to-date with the new repo location. We have a lot of updates rolling out for the next release that are on https://climpred.readthedocs.io/en/latest (latest, not stable). Plenty of un-addressed issues in the tracker as well: https://github.com/pangeo-data/climpred/issues.

The CI is really good and the code base is generally good. I’m trying to work on converting the whole code base over to an inheritance-based system. I am sketching it out at https://github.com/bradyrx/climpred_skeleton.

Anyways, if you end up finding the code base to be hard to understand, please let us know. Always looking for ways to make contributing easier. I feel like switching to inheritance will be hard in some ways for contributors, but should clean up a lot of redundant code that we currently have.

jukent · September 30, 2020, 10:49pm

@cgentemann Would you be interested in turning this into a SCIParCS project (Internship Projects )? I am very familiar with MODIS satellite but don’t have the most experience with the cloud so I could use the expertise of a co-mentor. We can discuss more if you are interested.

cgentemann · September 30, 2020, 11:56pm

@jukent, sure! Let me know what you need from me to do this.

Michael_Sumner · September 8, 2024, 9:00pm

Did this work or anything related go ahead? I’m trying to find similar followups

Topic		Replies	Views
Pangeo Office Hours Meta	18	1952	July 22, 2020
Next steps for Pangeo / PyOpenSci Collaboration Meta	4	869	January 19, 2023
New to Pangeo & Request for Contribution Info	2	251	December 22, 2023
pyOpenSci is looking for pangeo editors and reviewers! Open Science	2	378	January 19, 2023
Discussing Pangeo with the MLOps Community podcast? Meta	1	352	July 20, 2022

Get Involved in Pangeo: Entry Points for New Contributors

Template

For Potential Contributors

Matrix of Kubespawner profile_list Options

Description

Required Skills

Mentors

Contribute Example Notebooks to Pangeo Gallery

Required Skills

Mentors

Create an app for viewing terrain data using Xarray-spatial

Required Skills

Mentors

Contribute to the climpred package for analyzing climate predictions

What to contribute

Required Skills

Mentors

Making Terrain Data Viewer with Xarray

Test cloud optimized data formats for swath (L2) satellite data

Description

Required Skills

Mentors

NGINX Proxy for Cloud Storage

Description

Required Skills

Mentors

Related topics

Matrix of Kubespawner `profile_list` Options

Contribute to the `climpred` package for analyzing climate predictions