When setting up a cluster on the Niagara HPC cluster, such as:
cluster = SLURMCluster(project='def-ggalex',
cores=80, memory='100GB',
job_extra=['--nodes=1'],
walltime='0:30:00')
I get a SLURM error because I define the memory
parameter on the SLURMCluster
:
Task exception was never retrieved
future: <Task finished name='Task-48' coro=<_wrap_awaitable() done, defined at /home/g/ggalex/lourenco/scratch/miniconda3/envs/icesat/lib/python3.8/asyncio/tasks.py:688> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 64\nCommand:\nsbatch /tmp/tmpjg_q4yzl.sh\nstdout:\n\nstderr:\nSBATCH ERROR: \n The --mem=... request is not allowed nor necessary on Niagara; all nodes have\n the same amount of available memory (175 GiB) and each job get all the\n available memory of the node\nSBATCH: 1 error was found.\nSBATCH: Job not submitted because of this error.\n\n')>
Traceback (most recent call last):
File "/home/g/ggalex/lourenco/scratch/miniconda3/envs/icesat/lib/python3.8/asyncio/tasks.py", line 695, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/home/g/ggalex/lourenco/scratch/miniconda3/envs/icesat/lib/python3.8/site-packages/distributed/deploy/spec.py", line 59, in _
await self.start()
File "/home/g/ggalex/lourenco/scratch/miniconda3/envs/icesat/lib/python3.8/site-packages/dask_jobqueue/core.py", line 325, in start
out = await self._submit_job(fn)
File "/home/g/ggalex/lourenco/scratch/miniconda3/envs/icesat/lib/python3.8/site-packages/dask_jobqueue/core.py", line 308, in _submit_job
return self._call(shlex.split(self.submit_command) + [script_filename])
File "/home/g/ggalex/lourenco/scratch/miniconda3/envs/icesat/lib/python3.8/site-packages/dask_jobqueue/core.py", line 403, in _call
raise RuntimeError(
RuntimeError: Command exited with non-zero exit code.
Exit code: 64
Command:
sbatch /tmp/tmpjg_q4yzl.sh
stdout:
stderr:
SBATCH ERROR:
The --mem=... request is not allowed nor necessary on Niagara; all nodes have
the same amount of available memory (175 GiB) and each job get all the
available memory of the node
SBATCH: 1 error was found.
SBATCH: Job not submitted because of this error.
Any suggestion on how to deal with this issue? The cluster policy indeed requests users to fully use requested node resources (more details here) and instead of a warning, gives an error, which makes my life difficult when trying to scale out processing… I have tried to supress the memory=
term when calling SLURMCluster
, but dask does not allow this option.