Proposal: Open source guidebook for scientific cloud computing infrastructure

@TomNicholas and @martindurant’s questions at today’s Pangeo Showcase reminded me of this old, small issue I wrote proposing work on an open source guidebook for scientific cloud computing infrastructure. I don’t have any plans to pursue the idea, so I thought I’d share most of the content here in case it inspires others. Maybe even just an awesome list for Pangeo would help people stay up to date with options :man_shrugging:

Project Goal

Develop an open source guidebook for scientific cloud computing infrastructure, analogous to the Cloud-Optimized Geospatial Formats Guide. The primary audience for the guidebook would be research scientists, research software engineers, and data scientists. The secondary audience would be program managers. This guidebook would provide information about different infrastructure options available and a decision framework for deciding between different approaches.

Project Deliverables

  • Open source guidebook containing information about difference cloud computing infrastructure and considerations for choosing an approach

Open questions:

  • Should this be limited to just platforms? Or also orchestration tools? Distributed computing frameworks? Integration with CI/CD?

Context

There has been a proliferation of open and proprietary tools for running computational workflows on the cloud. This presents a lot of opportunity, but can also be overwhelming for people trying to understand and adopt the best approach for their given situation. Much of the information comes from people who understandably want to sell their own product. An unbiased evaluation of the pros and cons of different cloud computing approaches would be tremendously valuable as a resource for promoting high-quality science on the cloud.

Resources:

(this is already out of date, which speaks to the challenges of maintaining such a guide)

Pangeo discussions:

Platforms

JupyterHub based platforms

Serverless

Other

Orchestration options

Distributed computing frameworks

Stakeholder communities

  • Pangeo
  • Openscapes
  • Bioinformatics
  • VEDA
  • Cryocloud
  • ESIP
  • Cloud-Native Geospatial Forum
  • ROpenSci
  • PyOpenSci
5 Likes

We had a similar goal working on the Cubes & Clouds online course, which is available for free here https://eo-college.org/courses/cubes-and-clouds/

We cover topics such as open science following the FAIR principles, data processing in the cloud using open APIs (openEO) and more.

All the material is on GitHub and we are extending it including Pangeo concepts and exercises, @tinaok and @annefou are also in the project and we will release the updated version more or less in a month.

You are free to reuse any material :slight_smile:

1 Like

This sounds like a great topic for a Cookbook that could live on the Project Pythia gallery. If there’s momentum around building such a resource, I can imagine it would make a great breakout group project at this coming summer’s cookbook-building hackathon.

The previous discussions on this topic touched upon how people look to the Pangeo community for recommendations on infrastructure and services, as Pangeo should be in a position to give impartial summaries. We also discussed how this is important enough that it deserves a page on the main Pangeo website, because the cloud infrastructure & services really are (an optional) part of the cloud-native Pangeo stack that we are advocating for.

I think we also concluded that a reasonable way to go about this would be to have someone who doesn’t work at one of the platform services draft some summaries of various options (paid and open-source) for the Pangeo website, then invite representatives from the platforms to (publicly) comment before merging.

2 Likes