@TomNicholas and @martindurant’s questions at today’s Pangeo Showcase reminded me of this old, small issue I wrote proposing work on an open source guidebook for scientific cloud computing infrastructure. I don’t have any plans to pursue the idea, so I thought I’d share most of the content here in case it inspires others. Maybe even just an awesome list for Pangeo would help people stay up to date with options
Project Goal
Develop an open source guidebook for scientific cloud computing infrastructure, analogous to the Cloud-Optimized Geospatial Formats Guide. The primary audience for the guidebook would be research scientists, research software engineers, and data scientists. The secondary audience would be program managers. This guidebook would provide information about different infrastructure options available and a decision framework for deciding between different approaches.
Project Deliverables
- Open source guidebook containing information about difference cloud computing infrastructure and considerations for choosing an approach
Open questions:
- Should this be limited to just platforms? Or also orchestration tools? Distributed computing frameworks? Integration with CI/CD?
Context
There has been a proliferation of open and proprietary tools for running computational workflows on the cloud. This presents a lot of opportunity, but can also be overwhelming for people trying to understand and adopt the best approach for their given situation. Much of the information comes from people who understandably want to sell their own product. An unbiased evaluation of the pros and cons of different cloud computing approaches would be tremendously valuable as a resource for promoting high-quality science on the cloud.
Resources:
(this is already out of date, which speaks to the challenges of maintaining such a guide)
Pangeo discussions:
- What should a Pangeo 2.0 cloud tech stack look like?
- What's Next - Cloud - Partner-managed Infrastructure
- What's Next - Cloud - Pangeo-managed Infrastructure
Platforms
JupyterHub based platforms
Serverless
Other
- GitHub codespaces
- Posit Cloud
- Eratos
- Google Colab
- SageMaker Studio Lab
- Saturn Cloud
- JupyterLite
- Apache airflow
- Google Earth Engine
Orchestration options
Distributed computing frameworks
Stakeholder communities
- Pangeo
- Openscapes
- Bioinformatics
- VEDA
- Cryocloud
- ESIP
- Cloud-Native Geospatial Forum
- ROpenSci
- PyOpenSci