This is part of a follow-on conversation from the “What’s next for Pangeo” discussion that took place 2023-12-06. It is part of the overarching Cloud Infrastructure topic.
Since fairly early on in the first grant, people directly funded by Pangeo grants have managed cloud infrastructure for public use. This was both wonderful (broad public access) and terrible (maintaining systems and supporting users was hard).
There are many things that we could support and own ourselves. Let’s list a few:
Cloud storage, like s3://pangeo or s3://pangeo-scratch buckets
Hubs of some form
Something else? Ideas welcome.
What can Pangeo provide directly to the community that would provide value?
What are mechanisms to make that happen?
I suspect that even if Pangeo decides to leverage partners that there will still be some value to providing some core cloud infrastructure. An AWS account that beginners could use or small buckets for scratch space that delete data aggressively could be helpful and cheap to provide. Even if we leverage partners having a baseline account with some cost controls on it would probably be a nice thing to have. I hope that this can be made cheap enough to operate out of community funds.
How to deploy different elements of the Pangeo stack (storage, compute) in a way that’s sustainable over the long-term;
How to make sure Pangeo remains a laboratory with as little institutional friction as possible;
I think there is a tension between those two objectives that has to be addressed, but my hope is that the “storage” component of the Pangeo stack can be shown to be mature enough to be adopted by large institutions, who’ll do the hard work of funding the hardware, bandwidth, help desk support, maintenance, etc. Hopefully, this will free the Pangeo community to focus on innovation.
I would also be curious to see if we can bring back the pangeo binder in a more sustainable manner. mybinder.org continues to chug along as a community run project. I’d love to find a debrief on what were the main systematic challenges here (cryptomining attacks were a big one, which I’ve been putting renewed efforts into fighting on mybinder.org) and how we can possibly work through them.
@scottyhq was the primary person doing a lot of this work at that time I think (I may be wrong!), would love to hear from him!
I like that framing a lot @huard. I think there is some analogy to be made with the approaches to democratize science.
Ideally we want to have it both ways: Agile, short, highly experimental projects, which ultimately feed into more sustainable, long term infrastructure with broader governance and more secure funding that could serve more people with the tools to do meaningful science.
Indeed @yuvipanda - Myself and others put a lot of time (funded by ~3year NSF&NASA grants) into the pangeo-binder and it was entirely running on credits (at the $10k-100k level). My impression was that those credits came easily from GCP and AWS who seem to understand (whether altruistic or not) that demonstrations of large-scale compute against large open data archives are valuable! I also personally think the binderhubs were very valuable for workshops and getting people quickly running distributed dask clusters with minimal fuss.
But eventually the credits run out and someone has to keep asking for more - curious how mybinder.org has managed this? Also, even with a layer of GitHub authentication we had burner accounts running bitcoin mining I think to be sustainable a Pangeo binder reboot would need: 1. authentication, 2. user-level limits on cluster size 3. some sort of automated bitcoin detection 4. The ability to select a specific Cloud datacenter to deploy to. 5. A mechanism by which groups can easily contribute credits or operational costs (hypothetically if 2i2c were to run the binderhub could a researcher at X university easily apply credit codes to the account or contribute some $ to keep it rolling)?
A thought, but maybe we could be more efficient with credits? My experience running these things before is that most of the credit usage was waste, with people leaving things on for a long time without using them.
I’ll bet that if Pangeo got a lot better about using ephemeral computation, only turning things on when they’re needed that costs could drop by 1-2 orders of magnitude. (this is our experience internally at least).
Obviously that’s hard and would require effort in upstream projects (JupyterHub, Dask Gateway) but I mostly mention it to state that the sustainability problem can have technical as well as policy solutions. It’s not necessarily “someone else’s problem”.