With @yuvipanda’s help, I set up a hubploy-driven deployment on GCP for some of my neuroscience work. This deployment uses Dask Gateway to deploy clusters and send work to them. The hope was that this deployment would be more robust, in the sense that stale schedulers/workers (defined as schedulers/workers deployed from Python kernels that are no longer running, for example because the kernel has been restarted) wouldn’t linger on the kubernetes cluster (this was a big issue in previous iterations). Nevertheless, in the course of our work, we noticed that sometimes this is still happening. That is, a gateway scheduler and its workers keep hanging out, even after the creating kernel has gone away. What’s worse is that in this state, we can’t effectively start new gateway clusters. That is, a scheduler pod comes online, but workers seem not to, so we can’t get any work done. The solution so far has been to kill the stale pods through kubectl, but obviously that’s not a sustainable solution. Any ideas on how to debug/address this? Thanks!
I think https://github.com/dask/dask-gateway/issues/190 should help with part of this. Currently if the gateway itself restarts, it totally forgets every prior cluster it started.
That would be fine, so long as those clusters (i.e. schedulers and workers) all died when no kernel was pointing to them. But that seems to be the case only sometimes. I think that it would be useful to understand why they sometimes just keep hanging around. I’ve tried running kubectl log
in a couple of these cases, but the logs came back empty in these cases. Any other commands I should try to help diagnose this?
K8s has garbagecollection of resources, it can delete stranded pods etc if an ownerReferenced resource has been deleted.
Is this mechanism of relevance? I’m on mobile and cannot dig deeper atm but figured ill mention it briefly anyhow!
Yes: that does sound relevant, but the documentation I found is a bit over my head. It seems that there might be a way to tune how aggressive garbage collection is, and maybe my garbage collection is not aggressive enough? But I am not sure beyond that.
I’ve learned more about Dask Gateway and gained some experience. In version 0.7.1 of the Helm chart of Dask Gateway, I’ve experienced worker pods going stale but still clogging k8s resources.
That issue is tracked here: