Issues getting started with Xarray and Dask on Pangeo

Hi!

I am a heavy user of Pangeo - but not a hard core developer. But my first zero order solution to this issue is: do not use adapt.

For some reason does not work as it should. Based on a LOT of frustration accumulated in the past months, my explanation for these issues is that the adapt() scaling closes/kills workers too soon, and what happens is that parts of your computation are lost before being handled by other workers.

Other observation I had is that even with simple scale() - so no adapt - some times it happens that there is some “miscommunication” - for lack of better word - in the workflow, but you will see a huge improvement, I believe, if you stop using adapt().

Other important thing I want to point you to is to make sure that the clusters you are loosing connection with are correctly terminated and they don’t just hang in the background.

I have noticed that whenever I had this issue - meaning i lost connection with my cluster while using adapt() - even if I did cluster.close() or restart kernel, or even restart server (log out and in), the clusters were still there as zombie clusters, often still with x number of workers assigned (holding memory that cannot be used by others). In theory, unused clusters should terminate themselves after a certain amount of idle time, but sometimes they don’t.

I invite you to read on this issue (linked below) and see if you have many clusters hanging, and 1) scale them to zero after connecting to them, 2) closing them. In case you are unable to close them (there is a delay but eventually they should die) please open an issue and report them.

issue talking about zombie clusters

so summarizing:
open another notebook
run:

from dask_gateway import Gateway
g = Gateway()
g.list_clusters()

it will list you all the clusters hanging. Hopefully you have none - unless you are running something
then you can connect to any of those by doing

cluster = g.connect(g.list_clusters()[0].name)
cluster

or [1] or [2] and so on, depending on how many clusters you have
and then if you are certain that they shouldn’t be there you can scale them to zero
cluster.scale(0)
so at least they don’t hold memory
and then try to kill them
cluster.close()

please let me know if any of what I wrote is not clear. I have been there! happy to help.

Chiara