Hi everyone!
I have recently started using dask-jobqueue to help orchestrate a multi node cluster with Slurm. My base setup is (~/.config/dask/jobqueue.yaml
):
jobqueue:
slurm:
name: dask-worker
# Dask worker options - some reference here : https://stackoverflow.com/a/65702566/4743714
cores: 8 # Total number of cores per job
memory: "32 GB" # Total amount of memory per job
processes: 2 # Number of Python processes per job
python: null # Python executable
interface: ib0 # Network interface to use like eth0 or ib0
death-timeout: 60 # Number of seconds to wait if a worker can not find a scheduler
local-directory: ~/scratch/dask_journaling_cache # Location of fast local storage like /scratch or $TMPDIR
shared-temp-directory: null # Shared directory currently used to dump temporary security objects for workers
extra: null # deprecated: use worker-extra-args
worker-command: "distributed.cli.dask_worker" # Command to launch a worker
worker-extra-args: ['--lifetime','55m', '--lifetime-stagger', '4m', '--lifetime-restart'] # Additional arguments to pass to `dask-worker`
# SLURM resource manager options
# shebang: "#!/usr/bin/env bash" - no custom shell
queue: null
account: "def-ggalex"
walltime: '01:01:00'
env-extra: null
job-script-prologue: []
job-cpu: null
job-mem: null
job-extra: null
job-extra-directives: []
job-directives-skip: []
log-directory: null
# Scheduler options
scheduler-options: {}
distributed:
worker:
memory:
target: false # don't spill to disk
spill: false # don't spill to disk
pause: 0.80 # pause execution at this percentage of memory usage
terminate: 0.95 # restart the worker at this percentage of memory usage
After waiting an hour , to see if the --lifetime-restart
parameter kicks in (and it kicks, before only --lifetime
and --lifetime-stagger
were working in a “manual” mode, so I pursued to use the restart flag, and it seems that starts, but doesn’t finish (and the list of workers is empty on the dask panel). I look into one of the Slurm stdout files, and see this:
cat slurm-26985684.out
2024-03-19 23:16:38,005 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.82.1.47:34129'
2024-03-19 23:16:38,012 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.82.1.47:45251'
/home/lourenco/miniforge3/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (2.69s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
next(self.gen)
/home/lourenco/miniforge3/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (2.71s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
next(self.gen)
2024-03-19 23:16:42,227 - distributed.worker - INFO - Start worker at: tcp://10.82.1.47:36569
2024-03-19 23:16:42,227 - distributed.worker - INFO - Start worker at: tcp://10.82.1.47:37101
2024-03-19 23:16:42,227 - distributed.worker - INFO - Listening to: tcp://10.82.1.47:36569
2024-03-19 23:16:42,227 - distributed.worker - INFO - Worker name: SLURMCluster-7-0
2024-03-19 23:16:42,227 - distributed.worker - INFO - Listening to: tcp://10.82.1.47:37101
2024-03-19 23:16:42,227 - distributed.worker - INFO - dashboard at: 10.82.1.47:35399
2024-03-19 23:16:42,227 - distributed.worker - INFO - Worker name: SLURMCluster-7-1
2024-03-19 23:16:42,227 - distributed.worker - INFO - Waiting to connect to: tcp://10.82.49.4:42757
2024-03-19 23:16:42,227 - distributed.worker - INFO - dashboard at: 10.82.1.47:41755
2024-03-19 23:16:42,227 - distributed.worker - INFO - -------------------------------------------------
2024-03-19 23:16:42,228 - distributed.worker - INFO - Waiting to connect to: tcp://10.82.49.4:42757
2024-03-19 23:16:42,228 - distributed.worker - INFO - Threads: 4
2024-03-19 23:16:42,228 - distributed.worker - INFO - -------------------------------------------------
2024-03-19 23:16:42,228 - distributed.worker - INFO - Memory: 14.90 GiB
2024-03-19 23:16:42,228 - distributed.worker - INFO - Threads: 4
2024-03-19 23:16:42,228 - distributed.worker - INFO - Local Directory: /home/lourenco/scratch/dask_journaling_cache/dask-scratch-space/worker-1cvvyh3h
2024-03-19 23:16:42,228 - distributed.worker - INFO - Memory: 14.90 GiB
2024-03-19 23:16:42,228 - distributed.worker - INFO - Local Directory: /home/lourenco/scratch/dask_journaling_cache/dask-scratch-space/worker-j4up4ex9
2024-03-19 23:16:42,228 - distributed.worker - INFO - -------------------------------------------------
2024-03-19 23:16:42,228 - distributed.worker - INFO - -------------------------------------------------
2024-03-19 23:16:43,513 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-03-19 23:16:43,514 - distributed.worker - INFO - Registered to: tcp://10.82.49.4:42757
2024-03-19 23:16:43,514 - distributed.worker - INFO - -------------------------------------------------
2024-03-19 23:16:43,514 - distributed.core - INFO - Starting established connection to tcp://10.82.49.4:42757
2024-03-19 23:16:43,514 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-03-19 23:16:43,515 - distributed.worker - INFO - Registered to: tcp://10.82.49.4:42757
2024-03-19 23:16:43,515 - distributed.worker - INFO - -------------------------------------------------
2024-03-19 23:16:43,515 - distributed.core - INFO - Starting established connection to tcp://10.82.49.4:42757
2024-03-20 00:10:01,184 - distributed.worker - INFO - Closing worker gracefully: tcp://10.82.1.47:37101. Reason: worker-lifetime-reached
2024-03-20 00:10:01,282 - distributed.worker - INFO - Stopping worker at tcp://10.82.1.47:37101. Reason: worker-lifetime-reached
2024-03-20 00:10:01,284 - distributed.core - INFO - Connection to tcp://10.82.49.4:42757 has been closed.
2024-03-20 00:10:01,289 - distributed.nanny - INFO - Worker closed
2024-03-20 00:10:03,729 - distributed.nanny - WARNING - Restarting worker
/home/lourenco/miniforge3/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (5.19s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
next(self.gen)
2024-03-20 00:10:10,135 - distributed.worker - INFO - Start worker at: tcp://10.82.1.47:46869
2024-03-20 00:10:10,135 - distributed.worker - INFO - Listening to: tcp://10.82.1.47:46869
2024-03-20 00:10:10,135 - distributed.worker - INFO - Worker name: SLURMCluster-7-1
2024-03-20 00:10:10,135 - distributed.worker - INFO - dashboard at: 10.82.1.47:34411
2024-03-20 00:10:10,135 - distributed.worker - INFO - Waiting to connect to: tcp://10.82.49.4:42757
2024-03-20 00:10:10,135 - distributed.worker - INFO - -------------------------------------------------
2024-03-20 00:10:10,135 - distributed.worker - INFO - Threads: 4
2024-03-20 00:10:10,135 - distributed.worker - INFO - Memory: 14.90 GiB
2024-03-20 00:10:10,135 - distributed.worker - INFO - Local Directory: /home/lourenco/scratch/dask_journaling_cache/dask-scratch-space/worker-ct0av0xp
2024-03-20 00:10:10,135 - distributed.worker - INFO - -------------------------------------------------
2024-03-20 00:10:10,691 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-03-20 00:10:10,691 - distributed.worker - INFO - Registered to: tcp://10.82.49.4:42757
2024-03-20 00:10:10,691 - distributed.worker - INFO - -------------------------------------------------
2024-03-20 00:10:10,692 - distributed.core - INFO - Starting established connection to tcp://10.82.49.4:42757
2024-03-20 00:14:13,810 - distributed.worker - INFO - Closing worker gracefully: tcp://10.82.1.47:36569. Reason: worker-lifetime-reached
2024-03-20 00:14:14,027 - distributed.worker - INFO - Stopping worker at tcp://10.82.1.47:36569. Reason: worker-lifetime-reached
2024-03-20 00:14:14,028 - distributed.worker - WARNING - Scheduler was unaware of this worker; shutting down.
2024-03-20 00:14:14,028 - distributed.core - INFO - Connection to tcp://10.82.49.4:42757 has been closed.
2024-03-20 00:14:14,034 - distributed.nanny - INFO - Worker closed
2024-03-20 00:14:16,035 - distributed.nanny - ERROR - Worker process died unexpectedly
2024-03-20 00:14:16,332 - distributed.nanny - WARNING - Restarting worker
/home/lourenco/miniforge3/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.70s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
next(self.gen)
2024-03-20 00:14:19,226 - distributed.worker - INFO - Start worker at: tcp://10.82.1.47:35297
2024-03-20 00:14:19,226 - distributed.worker - INFO - Listening to: tcp://10.82.1.47:35297
2024-03-20 00:14:19,226 - distributed.worker - INFO - Worker name: SLURMCluster-7-0
2024-03-20 00:14:19,226 - distributed.worker - INFO - dashboard at: 10.82.1.47:37733
2024-03-20 00:14:19,226 - distributed.worker - INFO - Waiting to connect to: tcp://10.82.49.4:42757
2024-03-20 00:14:19,226 - distributed.worker - INFO - -------------------------------------------------
2024-03-20 00:14:19,226 - distributed.worker - INFO - Threads: 4
2024-03-20 00:14:19,226 - distributed.worker - INFO - Memory: 14.90 GiB
2024-03-20 00:14:19,226 - distributed.worker - INFO - Local Directory: /home/lourenco/scratch/dask_journaling_cache/dask-scratch-space/worker-qcp8a_kj
2024-03-20 00:14:19,226 - distributed.worker - INFO - -------------------------------------------------
2024-03-20 00:14:19,778 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-03-20 00:14:19,779 - distributed.worker - INFO - Registered to: tcp://10.82.49.4:42757
2024-03-20 00:14:19,779 - distributed.worker - INFO - -------------------------------------------------
2024-03-20 00:14:19,779 - distributed.core - INFO - Starting established connection to tcp://10.82.49.4:42757
slurmstepd: error: *** JOB 26985684 ON nc10147 CANCELLED AT 2024-03-20T04:17:52 DUE TO TIME LIMIT ***
On my interpretation of the log, it seems that it starts backing up, however, filesystem is slow. Due to that, the system is not able to kickout a new replica of the worker, before its TIMEOUT kill.
Is it correct this interpretation? I am a bit lost on what to tune, to fix such error.
(@geynard I wonder if you have any comment on this)