I regularly monitor the resource usage on our google cloud clusters, particularly ocean.pangeo.io, which my group uses for our daily science work. This cluster lives in the
dev-pangeo-io-cluster Kubernetes cluster. The daily cost of this cluster oscillates between $25 and $75 per day. However, in the past week I noticed a spike, with daily costs shooting up to $175 per day!
The cost is from preemptible nodes, indicating this is coming from dask usage. This is confirmed by the fact that there are no anomalies in jupyter pod creation:
However, if I look at dask worker pod creation, I see a huge recent spike.
If I divide the number of dask worker pods per day by the number of dask clusters per day, I get the average number of workers per cluster.
Before September, this was normal, with < 50 workers per cluster. But now we are seeing clusters with huge numbers of workers.
This situation is not sustainable–we will run out of credits much too soon at this burn rate. Furthermore, it reveals several problems with our current configuration.
- Users are able to create essentially an unlimited number of dask workers, up to the limit of the nodepool size.
- We have provided no guidance to our users about best practices.
- We have no way to contact our users (e.g. email list, etc.) to share such information
- We have no way to track dask statistics to a specific user (see https://github.com/dask/dask-kubernetes/issues/173), so we don’t even know who the culprit is.
It’s clear that this is not the users’ fault. We have no systems in place to effectively monitor and limit usage, and we lack sufficient documentation and training materials to effectively inform users.
I welcome ideas on what to do now.
Another possibility we should consider is whether we can reduce the baseline cost. The core pool does not appear to be very tightly packed:
@rabernat, thanks for bringing this up and for updating our analysis plots.
On the dask side, I think we should continue to push forward on dask-gateway so we can implement user controls. This puts us on a path to solve all the problems you listed above. If we move quickly, I think we can start transitioning people to gateway by next week sometime.
On pod packing, we have a few options. First and probably most realistically, we should be more specific about node pool configurations and pod resource allocations. The current state has been done in a fairly ad-hoc way. The other thing we can look into, and this mostly applies to GKE clusters, is node pool auto provisioning. This would essentially let us have many node pools that are configured based on requests at the k8s level. We wouldn’t have to manage packing pods into nodes, instead we’d let k8s do this.
Another persistent cost is the jupyter pool. While it theoretically is autoscaling, it seems to just be setting at 1 node the whole time:
This is a n1-highmem-16 node, which costs about $20 per day. We should use smaller machine types and take more advantage of autoscaling.
FWIW, I’ve reworked the binder.pangeo.io’s notebook pool to be slightly better packed and to use preemptible nodes. This should help cut down binder’s share of the costs.
So it happened again today that someone launched 300 dask workers. ocean.pangeo.io crashed. We need to do something more radical here.
https://github.com/pangeo-data/pangeo-cloud-federation/pull/439 has a relatively simple fix, until we can get dask-gateway in place on all these hubs.
until we can get dask-gateway in place on all these hubs.
To this end, we need people to help out with the following PRs:
Get chartpress working on Dask-Gateway to automate chart publishing:
Adding Dask-Gateway to Pangeo Helm Chart:
Thanks for working on this Joe!
Has anyone done another analysis of ocean.pangeo.io or some other hubs/binder hosted on AWS after the integration of dask-gateway into pangeo’s clusters? Ccing @scottyhq in case he has some numbers on hubs hosted on AWS.