Stale scheduler/workers on a deployment using Dask Gateway

arokem · January 15, 2020, 7:37pm

With @yuvipanda’s help, I set up a hubploy-driven deployment on GCP for some of my neuroscience work. This deployment uses Dask Gateway to deploy clusters and send work to them. The hope was that this deployment would be more robust, in the sense that stale schedulers/workers (defined as schedulers/workers deployed from Python kernels that are no longer running, for example because the kernel has been restarted) wouldn’t linger on the kubernetes cluster (this was a big issue in previous iterations). Nevertheless, in the course of our work, we noticed that sometimes this is still happening. That is, a gateway scheduler and its workers keep hanging out, even after the creating kernel has gone away. What’s worse is that in this state, we can’t effectively start new gateway clusters. That is, a scheduler pod comes online, but workers seem not to, so we can’t get any work done. The solution so far has been to kill the stale pods through kubectl, but obviously that’s not a sustainable solution. Any ideas on how to debug/address this? Thanks!

yuvipanda · January 15, 2020, 7:49pm

I think https://github.com/dask/dask-gateway/issues/190 should help with part of this. Currently if the gateway itself restarts, it totally forgets every prior cluster it started.

arokem · January 17, 2020, 5:33am

That would be fine, so long as those clusters (i.e. schedulers and workers) all died when no kernel was pointing to them. But that seems to be the case only sometimes. I think that it would be useful to understand why they sometimes just keep hanging around. I’ve tried running kubectl log in a couple of these cases, but the logs came back empty in these cases. Any other commands I should try to help diagnose this?

consideRatio · January 18, 2020, 1:53pm

K8s has garbagecollection of resources, it can delete stranded pods etc if an ownerReferenced resource has been deleted.

Is this mechanism of relevance? I’m on mobile and cannot dig deeper atm but figured ill mention it briefly anyhow!

arokem · January 24, 2020, 11:09pm

Yes: that does sound relevant, but the documentation I found is a bit over my head. It seems that there might be a way to tune how aggressive garbage collection is, and maybe my garbage collection is not aggressive enough? But I am not sure beyond that.

consideRatio · April 15, 2020, 9:07pm

I’ve learned more about Dask Gateway and gained some experience. In version 0.7.1 of the Helm chart of Dask Gateway, I’ve experienced worker pods going stale but still clogging k8s resources.

That issue is tracked here:

Topic		Replies	Views
Dask cluster stays idle for a long time before computing Pangeo Cloud Support	2	736	September 19, 2022
Kernel dies, but dask cluster still alive Pangeo Cloud Support	2	507	April 14, 2023
Worker restart when using dask-jobqueue HPC	5	398	March 20, 2024
No longer able to set `dask_gateway.Gateway.cluster_options()` manually from user-end Pangeo Cloud Support	5	440	February 6, 2024
Some usage analysis of ocean.pangeo.io Cloud	9	1228	June 2, 2020

Stale scheduler/workers on a deployment using Dask Gateway

Related topics