When using Dask SLURMCluster, can I avoid passing the memory argument?

rlourenco · March 29, 2022, 11:56pm

When setting up a cluster on the Niagara HPC cluster, such as:

cluster = SLURMCluster(project='def-ggalex',
                       cores=80, memory='100GB',
                       job_extra=['--nodes=1'],                       
                       walltime='0:30:00')

I get a SLURM error because I define the memory parameter on the SLURMCluster:

Task exception was never retrieved
future: <Task finished name='Task-48' coro=<_wrap_awaitable() done, defined at /home/g/ggalex/lourenco/scratch/miniconda3/envs/icesat/lib/python3.8/asyncio/tasks.py:688> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 64\nCommand:\nsbatch /tmp/tmpjg_q4yzl.sh\nstdout:\n\nstderr:\nSBATCH ERROR: \n The --mem=... request is not allowed nor necessary on Niagara; all nodes have\n the same amount of available memory (175 GiB) and each job get all the\n available memory of the node\nSBATCH: 1 error was found.\nSBATCH: Job not submitted because of this error.\n\n')>
Traceback (most recent call last):
  File "/home/g/ggalex/lourenco/scratch/miniconda3/envs/icesat/lib/python3.8/asyncio/tasks.py", line 695, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/home/g/ggalex/lourenco/scratch/miniconda3/envs/icesat/lib/python3.8/site-packages/distributed/deploy/spec.py", line 59, in _
    await self.start()
  File "/home/g/ggalex/lourenco/scratch/miniconda3/envs/icesat/lib/python3.8/site-packages/dask_jobqueue/core.py", line 325, in start
    out = await self._submit_job(fn)
  File "/home/g/ggalex/lourenco/scratch/miniconda3/envs/icesat/lib/python3.8/site-packages/dask_jobqueue/core.py", line 308, in _submit_job
    return self._call(shlex.split(self.submit_command) + [script_filename])
  File "/home/g/ggalex/lourenco/scratch/miniconda3/envs/icesat/lib/python3.8/site-packages/dask_jobqueue/core.py", line 403, in _call
    raise RuntimeError(
RuntimeError: Command exited with non-zero exit code.
Exit code: 64
Command:
sbatch /tmp/tmpjg_q4yzl.sh
stdout:

stderr:
SBATCH ERROR: 
 The --mem=... request is not allowed nor necessary on Niagara; all nodes have
 the same amount of available memory (175 GiB) and each job get all the
 available memory of the node
SBATCH: 1 error was found.
SBATCH: Job not submitted because of this error.

Any suggestion on how to deal with this issue? The cluster policy indeed requests users to fully use requested node resources (more details here) and instead of a warning, gives an error, which makes my life difficult when trying to scale out processing… I have tried to supress the memory= term when calling SLURMCluster, but dask does not allow this option.

geynard · March 30, 2022, 8:25am

Hi @rlourenco,

Fortunately, we have already some tools to circumvent different HPC clusters configurations. Here, you’ll want to use the header_skip argument to remove the --mem request from the automatically created bash script.

Hopefully, at one point we’ll also get in Template job scripts with Jinja by wtbarnes · Pull Request #370 · dask/dask-jobqueue · GitHub, allowing to fully customize the job submission script!

Next time, don’t hesitate to ask such a question as an issue on dask-jobqueue github project directly!

rlourenco · March 30, 2022, 9:46am

Hi @geynard ! Thanks for the header_skip argument, it will indeed be helpful in my case (Compute Canada is a federation, and has different site policies in place, so it will be very useful).

And I will look into the PR you mentioned, as well on dask-jobqueue webpage. Hopefully I will be able to contribute in some way to the community

geynard · March 30, 2022, 11:31am

That would be great!

rlourenco · March 30, 2022, 7:05pm

@geynard it still persists when I test. The call I have passed is:

cluster = SLURMCluster(project='def-ggalex',
                       cores=80, 
                       header_skip=['--mem'],
                       job_extra=['--nodes=1'],                       
                       walltime='0:30:00')
client = Client(cluster)
cluster

And then I get the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 cluster = SLURMCluster(project='def-ggalex',
      2                        cores=80, 
      3                        header_skip=['--mem'],
      4                        job_extra=['--nodes=1'],                       
      5                        walltime='0:30:00')
      6 client = Client(cluster)
      7 cluster

File ~/miniconda3/envs/icesat/lib/python3.8/site-packages/dask_jobqueue/core.py:529, in JobQueueCluster.__init__(self, n_workers, job_cls, loop, security, silence_logs, name, asynchronous, dashboard_address, host, scheduler_options, interface, protocol, config_name, **job_kwargs)
    524 if "processes" in self._job_kwargs and self._job_kwargs["processes"] > 1:
    525     worker["group"] = [
    526         "-" + str(i) for i in range(self._job_kwargs["processes"])
    527     ]
--> 529 self._dummy_job  # trigger property to ensure that the job is valid
    531 super().__init__(
    532     scheduler=scheduler,
    533     worker=worker,
   (...)
    538     name=name,
    539 )
    541 if n_workers:

File ~/miniconda3/envs/icesat/lib/python3.8/site-packages/dask_jobqueue/core.py:558, in JobQueueCluster._dummy_job(self)
    556     address = "tcp://<insert-scheduler-address-here>:8786"
    557 try:
--> 558     return self.job_cls(
    559         address or "tcp://<insert-scheduler-address-here>:8786",
    560         # The 'name' parameter is replaced inside Job class by the
    561         # actual Dask worker name. Using 'dummy-name here' to make it
    562         # more clear that cluster.job_script() is similar to but not
    563         # exactly the same script as the script submitted for each Dask
    564         # worker
    565         name="dummy-name",
    566         **self._job_kwargs
    567     )
    568 except TypeError as exc:
    569     # Very likely this error happened in the self.job_cls constructor
    570     # because an unexpected parameter was used in the JobQueueCluster
    571     # constructor. The next few lines builds a more user-friendly error message.
    572     match = re.search("(unexpected keyword argument.+)", str(exc))

File ~/miniconda3/envs/icesat/lib/python3.8/site-packages/dask_jobqueue/slurm.py:30, in SLURMJob.__init__(self, scheduler, name, queue, project, walltime, job_cpu, job_mem, job_extra, config_name, **base_class_kwargs)
     17 def __init__(
     18     self,
     19     scheduler=None,
   (...)
     28     **base_class_kwargs
     29 ):
---> 30     super().__init__(
     31         scheduler=scheduler, name=name, config_name=config_name, **base_class_kwargs
     32     )
     34     if queue is None:
     35         queue = dask.config.get("jobqueue.%s.queue" % self.config_name)

File ~/miniconda3/envs/icesat/lib/python3.8/site-packages/dask_jobqueue/core.py:173, in Job.__init__(self, scheduler, name, cores, memory, processes, nanny, protocol, security, interface, death_timeout, local_directory, extra, env_extra, header_skip, log_directory, shebang, python, job_name, config_name)
    171     job_class_name = self.__class__.__name__
    172     cluster_class_name = job_class_name.replace("Job", "Cluster")
--> 173     raise ValueError(
    174         "You must specify how much cores and memory per job you want to use, for example:\n"
    175         "cluster = {}(cores={}, memory={!r})".format(
    176             cluster_class_name, cores or 8, memory or "24GB"
    177         )
    178     )
    180 if job_name is None:
    181     job_name = dask.config.get("jobqueue.%s.name" % self.config_name)

ValueError: You must specify how much cores and memory per job you want to use, for example:
cluster = SLURMCluster(cores=80, memory='24GB')

I have tested also other values for header_skip=, such as '--memory', '-m', but the error remains the same. Any hint?

rlourenco · March 30, 2022, 7:28pm

@mrocklin any suggestion?

geynard · March 30, 2022, 8:50pm

The error is explained in the message: you still need to use memory kwarg. Even if it is not used for Slurm, it will be used by the Dask Workers to know how many memory they can use!

rlourenco · March 31, 2022, 7:49pm

Now it worked! Let’s see if scaling works too

Topic		Replies	Views
Dask Execution Issue: Successful in Notebook, Fails in SLURM HPC	6	406	August 1, 2023
Worker restart when using dask-jobqueue HPC	5	417	March 20, 2024
Xarray/Dask - Specify that a given task use huge amount of RAM to the Dask Ressource Manager Science	2	601	October 19, 2022
Very big memory load when using fast parallel file system HPC	11	1802	November 13, 2019
No longer able to set `dask_gateway.Gateway.cluster_options()` manually from user-end Pangeo Cloud Support	5	442	February 6, 2024

When using Dask SLURMCluster, can I avoid passing the memory argument?

Related topics