I regularly monitor the resource usage on our google cloud clusters, particularly ocean.pangeo.io, which my group uses for our daily science work. This cluster lives in the dev-pangeo-io-cluster Kubernetes cluster. The daily cost of this cluster oscillates between $25 and $75 per day. However, in the past week I noticed a spike, with daily costs shooting up to $175 per day!
The cost is from preemptible nodes, indicating this is coming from dask usage. This is confirmed by the fact that there are no anomalies in jupyter pod creation:
However, if I look at dask worker pod creation, I see a huge recent spike.
If I divide the number of dask worker pods per day by the number of dask clusters per day, I get the average number of workers per cluster.
Before September, this was normal, with < 50 workers per cluster. But now we are seeing clusters with huge numbers of workers.
This situation is not sustainable–we will run out of credits much too soon at this burn rate. Furthermore, it reveals several problems with our current configuration.
Users are able to create essentially an unlimited number of dask workers, up to the limit of the nodepool size.
We have provided no guidance to our users about best practices.
We have no way to contact our users (e.g. email list, etc.) to share such information
It’s clear that this is not the users’ fault. We have no systems in place to effectively monitor and limit usage, and we lack sufficient documentation and training materials to effectively inform users.
@rabernat, thanks for bringing this up and for updating our analysis plots.
On the dask side, I think we should continue to push forward on dask-gateway so we can implement user controls. This puts us on a path to solve all the problems you listed above. If we move quickly, I think we can start transitioning people to gateway by next week sometime.
On pod packing, we have a few options. First and probably most realistically, we should be more specific about node pool configurations and pod resource allocations. The current state has been done in a fairly ad-hoc way. The other thing we can look into, and this mostly applies to GKE clusters, is node pool auto provisioning. This would essentially let us have many node pools that are configured based on requests at the k8s level. We wouldn’t have to manage packing pods into nodes, instead we’d let k8s do this.
FWIW, I’ve reworked the binder.pangeo.io’s notebook pool to be slightly better packed and to use preemptible nodes. This should help cut down binder’s share of the costs.
Has anyone done another analysis of ocean.pangeo.io or some other hubs/binder hosted on AWS after the integration of dask-gateway into pangeo’s clusters? Ccing @scottyhq in case he has some numbers on hubs hosted on AWS.