I regularly monitor the resource usage on our google cloud clusters, particularly ocean.pangeo.io, which my group uses for our daily science work. This cluster lives in the
dev-pangeo-io-cluster Kubernetes cluster. The daily cost of this cluster oscillates between $25 and $75 per day. However, in the past week I noticed a spike, with daily costs shooting up to $175 per day!
The cost is from preemptible nodes, indicating this is coming from dask usage. This is confirmed by the fact that there are no anomalies in jupyter pod creation:
However, if I look at dask worker pod creation, I see a huge recent spike.
If I divide the number of dask worker pods per day by the number of dask clusters per day, I get the average number of workers per cluster.
Before September, this was normal, with < 50 workers per cluster. But now we are seeing clusters with huge numbers of workers.
This situation is not sustainable–we will run out of credits much too soon at this burn rate. Furthermore, it reveals several problems with our current configuration.
- Users are able to create essentially an unlimited number of dask workers, up to the limit of the nodepool size.
- We have provided no guidance to our users about best practices.
- We have no way to contact our users (e.g. email list, etc.) to share such information
- We have no way to track dask statistics to a specific user (see https://github.com/dask/dask-kubernetes/issues/173), so we don’t even know who the culprit is.
It’s clear that this is not the users’ fault. We have no systems in place to effectively monitor and limit usage, and we lack sufficient documentation and training materials to effectively inform users.
I welcome ideas on what to do now.