I have been experiencing strange behavior from dask gateway this week. The dask clusters just seem to freeze in an idle state. This is at the beginning of my computation
Work stealing is not happening. It takes several minutes to resolve.
After that the computation gets stuck again in a state like this. There is no error, no feedback, nothing obviously wrong. It just never finishes.
Could the starting issue be that you are waiting for cloud instances to spin up to support the dask gateway workers? This usually takes several minutes.
For the other issue, perhaps check the worker logs?
While I do notice cloud instances taking a while to spin up, the issue is when I already have active workers but some are stalling and sometimes only a few workers are handling the tasks while the rest don’t show any activity.
Checking the worker logs revealed an error message in one of the workers.
I have found a workaround to this issue. Thank you for commenting!