What's Next - Cloud - Partner-managed Infrastructure

This is part of a follow-on conversation from the “What’s next for Pangeo” discussion that took place 2023-12-06. It is part of the overarching Cloud Infrastructure topic.

Pangeo could engage with partners who provide cloud infrastructure as a service. To make this concrete, there are providers like the following:

  • 2I2C
  • Coiled
  • Planetary Computer
  • Nebari (from Quansight)
  • Earthmover

There are probably two topics here “Should we?” and “How best can we?”. These relationships can be transformative to Pangeo in both positive and negative ways. Probably we should be intentional and make sure that we achieve what we want as a community.

3 Likes

I’m obviously biased here, but some thoughts from me.

<takes off community hat, puts on opinionated hat>

Should we

I’m going to start with a comment from @rabernat during the meeting:

A key observation was that the same infrastructure that serves the research community is also desirable to the private sector. In contrast to on-prem hardware based infrastructure, cloud infrastructure is infinitely replicable and scalable. My conclusion is that it makes sense to commoditize the Pangeo Infrastructure and procure it “as a service” from commercial providers, rather than running a bespoke service. Fortunately there are many such service providers emerging (2i2c, Coiled, and of course my own startup Earthmover). The question of “who is going to pay for all these cloud computing environments” is now more-or-less purely a political and budgetary question, not a technical one.

And then my own thoughts:

As the person who both set up the first pangeo cloud infrastructure and then subsequently made a team to make a product to do this properly, I’ll state that there is a night-and-day difference both in terms of …

  1. The quality of service a team/product/company can provide
  2. The efficiency with which we can provide it

There’s no way I would ever go back to managing a single hub at a time funded by a sequence of grants. It’s also just far cheaper to amortize the cost of maintaining this stuff across many teams. (~$1000 a month vs the cost of an FTE)

How should we?

I think that Pangeo could provide value to both users and service providers by matching them effectively. This could be as simple as links in documentation that give unbiased information and reviews.

From a Coiled perspective we switched from thinking mostly about technology to thinking mostly about marketing a few months ago. This is because, for this community at least, the cloud deployment technology is pretty much done, but no one knows about us yet. There’s a lot of good to be done, I think, in communication.

Concretely, I’d love to find good ways to include Coiled (and other service providers) in Pangeo resources like the documentation in a way that felt natural and comfortable.

</takes off for-profit hat, puts back on community hat>

2 Likes

Will just note that while this may be true, it’s far from a solved problem. Can we think about partners not just in terms of technical partners but in terms of funding partners?

Is there a US agency that really ought to be funding some small level of public access, perhaps working with an existing organization to do the technical part efficiently?

Can we think bigger than a series of grants to advocating for a sustained program?

1 Like

Can we think worldwide and not focus on US only?

4 Likes

Can we think worldwide and not focus on US only?

Absolutely. I’m just thinking that US agencies might be good potential funders, but access for users worldwide is a high priority for me.

2 Likes

Things are cheap

If you’re doing things efficiently in the cloud (a huge if) large computations often cost pennies. (See Ten Cents per Terabyte. How much to run your workflow in the cloud? | Coiled)

For the sum of all people just getting started I’ll bet that costs would be below $1000 per month. Honestly, I suspect that we could manage this from corporate donations short term. (heck, I’ll chip in)

Institutions have budget

As people become successful it’s usually not impossible for them to get access to a cloud account from their employer / university / institute.

Institutions seem to be used to this today.

This doesn’t need to be a big problem

As a result, this doesn’t need to be a problem that this community even thinks about. We don’t need to go to NSF or ESA or NASA or the EU or whomever and ask for money. Each group maybe already has the money they need to do their work. There might not be a need to think about finding some central pool to cover this.

This problem may not be big enough to bother an agency with. Rather than “think bigger than a series of grants” maybe we think smaller? Cloud is cheap when done well.

This is increasingly true in eg the US, but less likely for early career researchers, even there. And it’s just not true at all in vast other parts of the world. Scientists in the global south can be severely resource-constrained, and they do not already have this kind of access. Or maybe you know of opportunities that are under-utilized?

In terms of how cheap it is, that must be presuming some kind of shared service? If everyone is paying to spin-up a cloud server instance for themselves it can add up very quickly. So again, maybe people need to be made aware of how to gain access?

I’d be thrilled to be wrong about the scale of resources required. I’m curious to hear some estimates of just how expensive it would be to provide some basic data storage and compute for, let’s say, mere hundreds of developing world climate researchers. Say they all have high resolution regional simulations that aren’t already on one of the cloud services.

It’s also okay if this is out of scope, and I’ll retreat back to the democratization thread.

1 Like

Scope

Yeah, the intent of this topic was to discuss the relationship between Pangeo and cloud infrastructure partner institutions (like 2i2c, coiled, …), and it sounds like you’re primarily intersted in working on the relationship between Pangeo and funding parnters. I support this endeavor, but think that it would probably be better suited to a different topic so that people can focus properly on each separate topic.

But let’s use the example

But if you’ll permit me, I’ll use your example here to demonstrate the value of cloud infrastructure partners:

I’m curious to hear some estimates of just how expensive it would be to provide some basic data storage and compute for, let’s say, mere hundreds of developing world climate researchers

In terms of how cheap it is, that must be presuming some kind of shared service? If everyone is paying to spin-up a cloud server instance for themselves it can add up very quickly

Yeah, I would 100% not do this. I would recommend that they use ephemeral compute on an as-needed basis. They might spin up 100 machines, but only for as long as it takes to process their data (often only a few minutes). This ends up being very cheap if you do everything correctly.

If you do everything correctly then processing 1 TB of cloud-based data costs around $0.10. So, purely thinking about compute (I’ll leave storage to others) let’s say that we had hundreds of researchers who all wanted to process 10 TB monthly (this is an overestimate in my experience) then this would cost hundreds of dollars monthly. I’m happy to pay for that personally.

It’s worth remembering though that the cloud doesn’t cost per-user, it costs based on usage. It’s far more likely that 90% of those researchers need only a little compute, but that some of those researchers need a lot more. Where people get stuck on costs is if they did what seemed natural to you, spin up static machines on a per-user basis. This is how people get to $10,000+ cloud bills.

The primary driver of cost is misuse, not overuse.

Partners help with efficiency

The back-of-the-envelope calculation above is explored more in this blogpost. It assumes that people do everything right, which is almost never the case without a lot of help. Partners help, and improve cost efficiency. Of course though, you still have to pay us :slight_smile:

1 Like

I’m glad to see this conversation happening! I have a lot of thoughts, but perhaps far too many.

I think pangeo really succeeded by being very practitioner driven, rather than being purely driven by the people building the tools or running the infrastructure (although often there was a lot of overlap between those groups). I’d love to hear more from current practitioners (across many privilege levels), and understand what their unsolved technical & social problems with the cloud are. It’s definitely different now than where it was 5 years ago, and we need to re-adjust so we can continue to be helpful. I’m not sure what a systematic way to find this out would be, but if anyone has ideas I’d love to hear them.

I also think it’s really important to be intentional about when (and why) to make trade-offs about openness and convenience. With a lot of input from many of the people involved in this discussion (particularly @mrocklin!), I wrote up “Freedoms for Open Source Software users in the Cloud”, which I think has served me and many others well. It’s been 5 years since, and many lessons have been learnt I am sure. I would love to do some kinda follow-up. It’s really important to me that we don’t end up in another matlab-like situation, while also making sure we don’t “purity-test” our way into irrelevance. I believe a good balance can be stuck here.

AGU happens in one of the most expensive cities in the world and is an expensive conference by itself, so it’s not the most representative place for this. But I’m going to be hanging around in the evenings (not at AGU itself) and hope to see many of you there.

1 Like

Include eratos.com and posit.co, both provide crosslang compute platforms (at least R and Python) :pray:

Nice thoughts Yuvi. Thanks for sharing. Are there specific actions you think you or others should take to improve things here?

How do we envision providing unbiased information? The trade off for efficiency is self-sufficiency and my concern is users becoming overly dependent on proprietary solutions. To be frank, reading this quote is especially concerning:

How to start finding a balance needs to involve bringing in more inclusive voices and some self-reflection on the perspectives of the dominant voices in these threads. Solutions rooted in scale, growth, and profit will not solve the crisis that altruistically is bringing us together in pangeo.

1 Like

my concern is users becoming overly dependent on proprietary solutions

I, and I’m sure many others, share this concern. However in order to discuss this productively I think we need to make a few distinctions about what we mean by “users” and “solutions” in this context:

Software vs Infrastructure Solutions:

The “core” of the Pangeo software stack is open-source in the sense of “Wide Open”. In particular Xarray and Zarr are multi-stakeholder, community-governed, and not managed by any one private company. Same for Jupyter/IPython/Numpy/SciPy/Pandas AFAIK.

But these are all exclusively software projects - using them and maintaining them takes people-hours, but those costs aren’t automatically incurred, they don’t grow linearly with the number of users, and there is rarely true urgency requiring a maintainer to be “on-call” if something breaks. Infrastructure isn’t like that - somebody eventually has to pay for the hardware compute/storage costs, those costs scale with number of users, and if something breaks it’s an urgent problem, not one that can wait until the next release. This distinction is important to keep in mind because it’s unlikely that for example the organising model that xarray uses would work for providing reliable infrastructure solutions.

Pangeo has multiple groups of users, who have different needs

Pangeo software tools (but not necessarily the infrastructure) are used today by:

(1) university academics,
(2) state-funded researchers (e.g. at National Labs),
(3) individual users globally (e.g. through outreach initiatives or just in the wild) and
(4) private companies / non-profit corporations.

All of these groups have different needs, different priorities, different politics, different funding situations, and different preferred relationships with the people managing their infrastructure. It is unlikely that one solution will fit all, and I agree we should push back against the idea of one solution for all users. Luckily that’s okay, because so long as the core pangeo stack is FOSS, it can be used by multiple infrastructure providers.

How do we envision providing unbiased information?

This would require care, but I don’t think is impossible. Clearly if one person from one company who is selling one solution wrote the recommendations, it would be a problem. But Pangeo already has many people from different constituencies who would be qualified to weigh in, a steering commitee and (now) an independent legal structure.

There is space for many infrastructure solutions

There are multiple possible models for infrastructure solutions. Matt (and Ryan, who is the source of the original comment you quoted Brianna) are proposing one model, which IIUC essentially uses capital from private companies to fund the development and maintenance of infrastructure that could perhaps also serve some people in the other 3 groups. There is clearly a market and a place for this model. However “We” in the rest of the community are still free to reject this model for any reason.

What other models are possible? I can think of: federally-organised / philanthropically-funded / volunteer-driven as broad categories. I’m personally interested in thinking about which of these models can best serve the wider global scientific community, with diversity and inclusion as an explicit aim. We have some nascent discussion of that question in the What's Next - Democratization - Science/Community thread that Naomi linked.

Specifically what are we afraid of?

It also helps to distinguish what scenario we are concerned about in order to avoid it, rather than just Private == Bad. Are we concerned about vendor lock-in? Losing access to our data? Biased driving of software development priorities? Non-democratic governance? Anti-competitive incentives stifling further innovation? If so I would want to openly discuss these concerns specifically, so choices can be made between offerings depending on priorities, or to ameliorate these risks. (EDIT: This last point is really another way of saying the same thing that Yuvi’s blog post said.)

1 Like

Great question, @mrocklin. I think there’s actually something super valuable you can do with your Coiled CEO hat on, because:

  1. You have a lot of prior experience doing open source stuff, and doing it well
  2. Coiled (AFAIK) continues to do a lot of open source development (for which me, and many others, are very grateful!)
  3. But coiled also has many proprietary components, particularly the (new to me) coiled package (I can’t find the source for it at least, but please do correct me if I’m wrong). A lot of the infrastructure itself you run is proprietary, and not, say, an instance of dask-gateway. (I want to explicitly state that this isn’t a negative thing per-se! We would all be sitting in basements fighting over datacenter space if it were)
  4. Coiled has also taken VC funding, and it’s also been a few years so I’m sure you’ve had a good amount of experience navigating this space.

These are all experiences that you have that I don’t, and given the large amount of trust the general OSS community seems to have in you, that puts you in a very unique position to answer the following question:

“What projects get to be open source, and what projects get to be proprietary?”

I think it’s a very important line to draw, and I’d love to hear the process you use to draw it and make that distinction! As I said earlier, trade-offs with openness and convenience need to be very intentional - but I also understand the necessity for this trade-off. So I’d love to hear:

  1. What your experience has been making this tradeoff?
  2. How that has evolved over the course of Coiled’s existence?
  3. Where that line is now?
  4. What process you use for evolving that line over time?

The order of importance for me really is 3, 4, 2 and 1.

I also think of Coiled as sort of ‘early’ in the current generation of VC funded operators in the data science space (which itself is perhaps a cycle behind the ‘infrastructure’ space broadly here in terms of VC funding), so your experience here, openly shared, would be very valuable to two groups of people:

  1. Folks trying to make choices about how deeply to depend on proprietary VC funded tech. Again, I want to clarify this isn’t always a bad thing (see also: docker). But I want this to be an intentional and informed choice folks can make. I wish the absolute best for everyone at coiled, but many VC funded startups do fail, and it’s important to make an informed trade-off of the risks here. Even in unbounded success, there’s always the unintentional Mathworks trap that we can fall into, and having external documents that can be referenced will aid folks (including myself) in making informed choices, rather than choices based on fears projected from elsewhere (let’s just say interactions between VCs and open source has not been great for the last couple years).
  2. Other companies in this space that may raise funding, as I’m sure there are many more coming here. The cryptocurrency hype cycle kinda left us alone but the AI hype cycle will definitely not :slight_smile:

I appreciate you engaging here, I hope this is a specific enough of an ask. I also recognize the reality that your time is constrained and limited, so I also hope worthwhile enough for Coiled to answer these questions. It would definitely help me!

Separate from this, and not specifically directed at Matt: I recommend reading George Orwell’s “Homage to Catalonia” (published 1938) for a good description of what happens if you ‘purity test’ for ideology, and whatever is going on now with Hashicorp / Elastic / Mongo / Element for what happens if you instead focus purely on convenience / investment ROI.

1 Like

Hey Yuvi!

I appreciate the specific questions. I find that specific questions are helpful at getting folks to converge to a specific outcome. I’ll restate my suggestion from above in order to reset the stage a bit:

I think that Pangeo could provide value to both users and service providers by matching them effectively.

With that in mind, I like @TomNicholas 's recent comment:

Pangeo has multiple groups of users, who have different needs

I think that people who want to build and maintain their own system have some good tools at their disposal. They could always be better, but I’m proud of what we’ve delivered over the years for this class of users.

Additionally, I want people for whom managed solutions are a good fit to find those managed solutions. This isn’t everybody, but my experience is that it’s a heck of a lot of people. There’s generally a tradeoff of having full control over your tech stack and having a lot of safety but also a large technical / maintenance / inaccessibility burden, or having little control over your tech stack, but living in a magical world of sparkles where everything purports to be easy :sparkles:.

On to specific questions:

What projects get to be open source, and what projects get to be proprietary?

I’m not sure I understand this question. A developer of a project gets to choose if they develop that project in an open source way, or if they choose to retain ownership of the code.

Instead I think that the question is of the user

what do you care about in the software that you use?

Different user groups care about very different things. For example, users who care deeply about multi-decade-long archival file formats will probably choose HDF over Zarr, even though Zarr has plenty of lovely features. The choice of correct technology is, as you know, highly dependent on the user group. There are many factors at play.

For open vs proprietary, some indicators I see for open source over proprietary are …

  1. Cost sensitivity
  2. Avoid vendor lock in because we plan to depend on this code for a while
  3. We’re building other systems and want transitive openness
  4. Attract collaborative developers
  5. Knowing what you’re depending on for security or accuracy reasons, etc…

Dask was built with this in mind, and I think it was the right choice in its role as OSS infrastructure intended to be used by lots of other OSS projects.

However, there are also lots of users who really don’t care about these things. These are users for whom it is far cheaper to buy a product than to build something themselves (human time is very expensive), who would rather be locked-in to a professionally managed system rather than locked-in to maintaining their own system, who aren’t building infrastructure for other people to depend on, and are mostly working alone. The risk of a company failing is far lower than the risk that they don’t get something done at all because a project took too long. They’re way more focused on doing some specific analysis than on building a clean stack.

My experience is that the farther you get away from core technology, the less people care about the things that we care about, and the more they just want something that gets the job done, even if it proprietary. There are a lot of these users and I think we can help them.

Happy anniversary of joining the pangeo discourse, matt! (Discourse put a cake next to your name!)

Indeed, and we need to help people understand the trade-offs between choosing open source tech and proprietary tech. As a sidebar, I also want to draw a distinction between open source tech and self-hosted tech - open source tooling doesn’t necessarily mean you have to self host them (only that it is possible to), as evidenced by the large number of companies that basically make money off of hosting open source stuff. You can also self-host things that aren’t open source (hello Oracle). This was also one of the major touch points of the blog post I had linked to earlier - don’t run your own S3, but do use an abstraction (like fsspec) so you can don’t concede power to choose to the provider.

I understand there’s a lot of people who don’t care about the proprietary nature of what they use, but I do think it’s fundamentally important to open science that it’s an explicit trade-off made intentionally. It’s hard to do open science in matlab, for an extreme example. This is also a place where the trade-off is different for companies vs folks trying to do ‘open science’, and also different in various stages of the scientific process (prototyping and ‘playing’ around has different needs than putting source out on a paper). So I’d love to know the answers to the questions I posted.

I’d also love to hear from others, particularly open science practitioners, about whether the questions I posed would be useful to them or not.

Sure, and in the specific case of Dask, science code is well protected. Code often looks like this:

import coiled

cluster = coiled.Cluster(
    n_workers=100,
    worker_memory="128 GiB",
    spot_policy="spot_with_fallback",
    region="us-west-2",
)
client = cluster.get_client()

import xarray
... do science stuff

and then later on if you want to switch to, say, an HPC system, you can do that.

# import coiled
from dask_jobqueue import PBSCluster
# cluster = coiled.Cluster(...)
cluster = PBSCluster(...)
client = cluster.get_client()

import xarray
... do science stuff

My guess is that this meets your personal standard of interfaces. If someone was both able and motivated to build a Coiled-like thing then they could do so and move their science code over to that system.

Coiled provides infrastructure. It also provides a ton of stuff that the OSS systems just don’t deliver, but which users report as being really really important. Some examples:

  1. Cost tracking and cost limiting
  2. Software and credential synchronization
  3. Historical tracking of computations and birds-eye view for experts to help support novices

(happy to demo stuff next week if you’re around at the Coiled/Pangeo booth)

OSS infrastructure solutions today are not always all that accessible to unsophisticated users, at least when we start getting to the arena of large scale computation where I generally play.

So I’d love to know the answers to the questions I posted.

Sure. I’m not sure I have really well thought out responses here. The general answer is that this isn’t top of mind for me.

  1. What your experience has been making this tradeoff?

That’s too broad. i’m going to pass on this for now. But more specific questions would be welcome.

  1. How that has evolved over the course of Coiled’s existence?

I’m not sure we’ve changed much over time

  1. Where that line is now?

I’m not sure that it’s a line. In general stuff that goes into Dask is OSS. This tends to be science-y type algorithms. Stuff that has to do with the 24x7 cloud service we run in proprietary because this is what people pay for and because it wouldn’t really do anyone any good if it was OSS.

  1. What process you use for evolving that line over time?

I make decisions based on the information available to me at the time.

Hey everyone, thank you all for the very interesting discussion. I truly appreciate all the perspectives here and as @yuvipanda I fear I have way too many thoughts on this, so let me try to focus on one main point.

Are we concerned about vendor lock-in? Losing access to our data? Biased driving of software development priorities? Non-democratic governance? Anti-competitive incentives stifling further innovation?

I think that the point I am most concerned is both Biased driving of software development priorities and the consequences of donation based open science.

I suspect that decisions made in this broader discussion have far reaching influence on the pangeo community as a whole and I am struggling to cleanly separate this from the democratization of science. I have posted this over in the other thread to keep this shorter.

My main concern here is how the arguments for efficiency seem to strongly push a donation based approach to foster open science. @mrocklin said above…

I’m happy to pay for that personally.

While I think this is a very generous offer, I am skeptical that primarily relying on individual/corporate donations can lead to lasting change. While donating resources is fairly straight-forward ( I often have said the same thing as Matt above in the past), I really believe we cannot stop there as a mean to open up science to more people around the globe. I think this runs the risk of creating a strong dependency and power imbalance which has historically often created problems for developmental aid in general.

That being said I am in no way generally against strengthening/formalizing how the pangeo community engages with partners, but I think we need to be conscious about resulting power dynamics, and how to navigate them. Additionally I think it is paramount that such a decision is made in a forum/body where the people who do not have access to science tools at the moment have a seat at the table (as @briannapagan and @yuvipanda have already emphasized)!

2 Likes

enjoying this discussion, thanks!

The “correctly” phrase up there really resonates with me, how do we provide that guidance to an individual - I don’t think we can without a year or two of spinup paying for a commercial platform that handles the devops for us until we have actual staff or abstracted expertise for it.

I feel like I agree with everyone’s comments here – I guess that’s why I appreciate this community so much! :blush:

Here’s my take, based on helping groups and organizations who work with large simulation data on-board to open-source scalable Cloud solutions (using the Pangeo stack). After running initial training using Nebari, I present them with three options to sustain their own infrastructure:

  1. Coiled
  2. 2i2c
  3. Nebari

To figure out what makes the most sense for them, I ask questions like:

  • Do you require users to use only specific cloud providers (e.g. USGS only AWS, NATO only Azure)?
  • Do you intend to use only cloud credits or do you intend spend actual $$?
  • Is having a JupyterHub important?
  • Is having horizontal compute in multiple regions (or multiple clouds) important?
  • Is 24/7 infrastructure important or are you okay with occasional downtime?
  • Do you have local computer admin folks who are interested in managing users and configuring infrastructure?

Depending on their answers, I try to advise them toward specific infrastructure solutions. Some examples:

  • If folks don’t value a JupyterHub that much, have $$ to spend and value multi-region, then I suggest Coiled as the obvious choice.

  • If they value a JupyterHub, don’t want to manage anything and have $$ to spend, then I suggest 2i2c as a great solution.

  • If they only have research credits, have someone with some modest tech skills willing to so some admin and don’t mind a few days of downtime per year, I suggest Nebari as a great choice.

And for both 2i2c and Nebari, if they want to add compute in other regions other than where their Dask Gateway cluster is running, I suggest adding Coiled.

2 Likes