Worker restart when using dask-jobqueue

Hi everyone!

I have recently started using dask-jobqueue to help orchestrate a multi node cluster with Slurm. My base setup is (~/.config/dask/jobqueue.yaml):

jobqueue:
  slurm:
    name: dask-worker

    # Dask worker options - some reference here : https://stackoverflow.com/a/65702566/4743714
    cores: 8                 # Total number of cores per job
    memory: "32 GB"                # Total amount of memory per job
    processes: 2                # Number of Python processes per job

    python: null                # Python executable
    interface: ib0             # Network interface to use like eth0 or ib0
    death-timeout: 60           # Number of seconds to wait if a worker can not find a scheduler
    local-directory: ~/scratch/dask_journaling_cache	   # Location of fast local storage like /scratch or $TMPDIR
    shared-temp-directory: null       # Shared directory currently used to dump temporary security objects for workers
    extra: null                 # deprecated: use worker-extra-args
    worker-command: "distributed.cli.dask_worker" # Command to launch a worker
    worker-extra-args: ['--lifetime','55m', '--lifetime-stagger', '4m', '--lifetime-restart']       # Additional arguments to pass to `dask-worker`

    # SLURM resource manager options
    # shebang: "#!/usr/bin/env bash" - no custom shell
    queue: null
    account: "def-ggalex"
    walltime: '01:01:00'
    env-extra: null
    job-script-prologue: []
    job-cpu: null
    job-mem: null
    job-extra: null
    job-extra-directives: []
    job-directives-skip: []
    log-directory: null

    # Scheduler options
    scheduler-options: {}

distributed:
  worker: 
    memory:
      target: false # don't spill to disk
      spill: false # don't spill to disk
      pause: 0.80 # pause execution at this percentage of memory usage
      terminate: 0.95 # restart the worker at this percentage of memory usage

After waiting an hour , to see if the --lifetime-restart parameter kicks in (and it kicks, before only --lifetime and --lifetime-stagger were working in a “manual” mode, so I pursued to use the restart flag, and it seems that starts, but doesn’t finish (and the list of workers is empty on the dask panel). I look into one of the Slurm stdout files, and see this:

cat slurm-26985684.out
2024-03-19 23:16:38,005 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.82.1.47:34129'
2024-03-19 23:16:38,012 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.82.1.47:45251'
/home/lourenco/miniforge3/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (2.69s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/home/lourenco/miniforge3/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (2.71s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
2024-03-19 23:16:42,227 - distributed.worker - INFO -       Start worker at:     tcp://10.82.1.47:36569
2024-03-19 23:16:42,227 - distributed.worker - INFO -       Start worker at:     tcp://10.82.1.47:37101
2024-03-19 23:16:42,227 - distributed.worker - INFO -          Listening to:     tcp://10.82.1.47:36569
2024-03-19 23:16:42,227 - distributed.worker - INFO -           Worker name:           SLURMCluster-7-0
2024-03-19 23:16:42,227 - distributed.worker - INFO -          Listening to:     tcp://10.82.1.47:37101
2024-03-19 23:16:42,227 - distributed.worker - INFO -          dashboard at:           10.82.1.47:35399
2024-03-19 23:16:42,227 - distributed.worker - INFO -           Worker name:           SLURMCluster-7-1
2024-03-19 23:16:42,227 - distributed.worker - INFO - Waiting to connect to:     tcp://10.82.49.4:42757
2024-03-19 23:16:42,227 - distributed.worker - INFO -          dashboard at:           10.82.1.47:41755
2024-03-19 23:16:42,227 - distributed.worker - INFO - -------------------------------------------------
2024-03-19 23:16:42,228 - distributed.worker - INFO - Waiting to connect to:     tcp://10.82.49.4:42757
2024-03-19 23:16:42,228 - distributed.worker - INFO -               Threads:                          4
2024-03-19 23:16:42,228 - distributed.worker - INFO - -------------------------------------------------
2024-03-19 23:16:42,228 - distributed.worker - INFO -                Memory:                  14.90 GiB
2024-03-19 23:16:42,228 - distributed.worker - INFO -               Threads:                          4
2024-03-19 23:16:42,228 - distributed.worker - INFO -       Local Directory: /home/lourenco/scratch/dask_journaling_cache/dask-scratch-space/worker-1cvvyh3h
2024-03-19 23:16:42,228 - distributed.worker - INFO -                Memory:                  14.90 GiB
2024-03-19 23:16:42,228 - distributed.worker - INFO -       Local Directory: /home/lourenco/scratch/dask_journaling_cache/dask-scratch-space/worker-j4up4ex9
2024-03-19 23:16:42,228 - distributed.worker - INFO - -------------------------------------------------
2024-03-19 23:16:42,228 - distributed.worker - INFO - -------------------------------------------------
2024-03-19 23:16:43,513 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-03-19 23:16:43,514 - distributed.worker - INFO -         Registered to:     tcp://10.82.49.4:42757
2024-03-19 23:16:43,514 - distributed.worker - INFO - -------------------------------------------------
2024-03-19 23:16:43,514 - distributed.core - INFO - Starting established connection to tcp://10.82.49.4:42757
2024-03-19 23:16:43,514 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-03-19 23:16:43,515 - distributed.worker - INFO -         Registered to:     tcp://10.82.49.4:42757
2024-03-19 23:16:43,515 - distributed.worker - INFO - -------------------------------------------------
2024-03-19 23:16:43,515 - distributed.core - INFO - Starting established connection to tcp://10.82.49.4:42757
2024-03-20 00:10:01,184 - distributed.worker - INFO - Closing worker gracefully: tcp://10.82.1.47:37101. Reason: worker-lifetime-reached
2024-03-20 00:10:01,282 - distributed.worker - INFO - Stopping worker at tcp://10.82.1.47:37101. Reason: worker-lifetime-reached
2024-03-20 00:10:01,284 - distributed.core - INFO - Connection to tcp://10.82.49.4:42757 has been closed.
2024-03-20 00:10:01,289 - distributed.nanny - INFO - Worker closed
2024-03-20 00:10:03,729 - distributed.nanny - WARNING - Restarting worker
/home/lourenco/miniforge3/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (5.19s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
2024-03-20 00:10:10,135 - distributed.worker - INFO -       Start worker at:     tcp://10.82.1.47:46869
2024-03-20 00:10:10,135 - distributed.worker - INFO -          Listening to:     tcp://10.82.1.47:46869
2024-03-20 00:10:10,135 - distributed.worker - INFO -           Worker name:           SLURMCluster-7-1
2024-03-20 00:10:10,135 - distributed.worker - INFO -          dashboard at:           10.82.1.47:34411
2024-03-20 00:10:10,135 - distributed.worker - INFO - Waiting to connect to:     tcp://10.82.49.4:42757
2024-03-20 00:10:10,135 - distributed.worker - INFO - -------------------------------------------------
2024-03-20 00:10:10,135 - distributed.worker - INFO -               Threads:                          4
2024-03-20 00:10:10,135 - distributed.worker - INFO -                Memory:                  14.90 GiB
2024-03-20 00:10:10,135 - distributed.worker - INFO -       Local Directory: /home/lourenco/scratch/dask_journaling_cache/dask-scratch-space/worker-ct0av0xp
2024-03-20 00:10:10,135 - distributed.worker - INFO - -------------------------------------------------
2024-03-20 00:10:10,691 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-03-20 00:10:10,691 - distributed.worker - INFO -         Registered to:     tcp://10.82.49.4:42757
2024-03-20 00:10:10,691 - distributed.worker - INFO - -------------------------------------------------
2024-03-20 00:10:10,692 - distributed.core - INFO - Starting established connection to tcp://10.82.49.4:42757
2024-03-20 00:14:13,810 - distributed.worker - INFO - Closing worker gracefully: tcp://10.82.1.47:36569. Reason: worker-lifetime-reached
2024-03-20 00:14:14,027 - distributed.worker - INFO - Stopping worker at tcp://10.82.1.47:36569. Reason: worker-lifetime-reached
2024-03-20 00:14:14,028 - distributed.worker - WARNING - Scheduler was unaware of this worker; shutting down.
2024-03-20 00:14:14,028 - distributed.core - INFO - Connection to tcp://10.82.49.4:42757 has been closed.
2024-03-20 00:14:14,034 - distributed.nanny - INFO - Worker closed
2024-03-20 00:14:16,035 - distributed.nanny - ERROR - Worker process died unexpectedly
2024-03-20 00:14:16,332 - distributed.nanny - WARNING - Restarting worker
/home/lourenco/miniforge3/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.70s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
2024-03-20 00:14:19,226 - distributed.worker - INFO -       Start worker at:     tcp://10.82.1.47:35297
2024-03-20 00:14:19,226 - distributed.worker - INFO -          Listening to:     tcp://10.82.1.47:35297
2024-03-20 00:14:19,226 - distributed.worker - INFO -           Worker name:           SLURMCluster-7-0
2024-03-20 00:14:19,226 - distributed.worker - INFO -          dashboard at:           10.82.1.47:37733
2024-03-20 00:14:19,226 - distributed.worker - INFO - Waiting to connect to:     tcp://10.82.49.4:42757
2024-03-20 00:14:19,226 - distributed.worker - INFO - -------------------------------------------------
2024-03-20 00:14:19,226 - distributed.worker - INFO -               Threads:                          4
2024-03-20 00:14:19,226 - distributed.worker - INFO -                Memory:                  14.90 GiB
2024-03-20 00:14:19,226 - distributed.worker - INFO -       Local Directory: /home/lourenco/scratch/dask_journaling_cache/dask-scratch-space/worker-qcp8a_kj
2024-03-20 00:14:19,226 - distributed.worker - INFO - -------------------------------------------------
2024-03-20 00:14:19,778 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-03-20 00:14:19,779 - distributed.worker - INFO -         Registered to:     tcp://10.82.49.4:42757
2024-03-20 00:14:19,779 - distributed.worker - INFO - -------------------------------------------------
2024-03-20 00:14:19,779 - distributed.core - INFO - Starting established connection to tcp://10.82.49.4:42757
slurmstepd: error: *** JOB 26985684 ON nc10147 CANCELLED AT 2024-03-20T04:17:52 DUE TO TIME LIMIT ***

On my interpretation of the log, it seems that it starts backing up, however, filesystem is slow. Due to that, the system is not able to kickout a new replica of the worker, before its TIMEOUT kill.

Is it correct this interpretation? I am a bit lost on what to tune, to fix such error.

(@geynard I wonder if you have any comment on this)

Hi @rlourenco,

I’m not sure what you are trying to achieve here, why do you want to use --lifetime-restart parameter in a Slurm job? It looks your Workers are restarting (despite the warning on scratch directory), but the Slurm walltime is reached probably just after and everything is topped.

Hi @geynard !

Answering to your question, my idea is to run a cluster, with small walltime jobs (less than 3 h, but upper than 1 h), to try to get these small jobs prioritized. And, if my run would still be ongoing, the usage of lifetime parametrization would be to keep the main run alive, with a phased planned shutdown of workers, so then that run can be completed (if I am not wrong, @mrocklin mentioned this strategy of small job request on his old jobqueue example video).

The workers seems to start phasing out as planned, but the parameter --lifetime-restart has not kicked in, so workers are not automatically restored once walltime is reached. For online analysis, I can somehow manage this, because I can manually increase and decreaase the job requests, which then witll have different timings to end, so the main job won’t be killed.

However, if I want to do a longer run, overnight, that becomes tricky. A solution would be requesting longer jobs (in the base worker setup), but I was curious if someone was able to make this strategy of small jobs, and “ressucitating” them to work, as this exists as an option in jobqueue.

@andersy005 I have just watched your talk in the Workshop for Dask in HPC (about 3 years ago - Video Conferencing, Web Conferencing, Webinars, Screen Sharing - Zoom ) . I wonder if you have been using this flag when using jobqueue .

Actually, this parameter has kicked in, but it only has an effect on the Worker process, not the job. For what your trying to achieve, you need to use an Adaptive cluster. See this documentation for an example.

Be aware though that there are still some issues when using adaptive with dask-jobqueue. Be sure for example to use only one process per job.

2 Likes

Oh. Thanks for noticing the need for the adaptative. I was getting jobs killed with process > 1, and likely related to the way I was scheduling .

ps: [Update] Just finished testing, and it works like a charm. Thanks @geynard for the solution :slight_smile: