Very big memory load when using fast parallel file system

dougiesquire · September 23, 2019, 3:34am

I’ve been experiencing curious behaviour when using xarray and dask-jobqueue to parallelise the computation of a monthly climatology from zarr-format data stored on an ultra-fast parallel file storage system (https://www.beegfs.io/content/).

The memory load on each worker increases very rapidly throughout the operation, to values substantially larger than what I believe any one worker should have at any one time based on the chunking. This does not occur (or at least occurs to a much lesser degree) when the data is on a different file system.

It’s almost as though the workers are unable to reduce and process data at a sufficient rate to keep up with reading input. This probably makes no sense.

Given that the issue seems to be related to the hardware I’m using, it’s difficult for me to provide a reproducible example. So, I’m seeking input regarding how to debug and understand this issue. My apologies in advance if this is not the right forum for this type of question.

rabernat · September 23, 2019, 1:51pm

Hi @dougiesquire - welcome to the forum and thanks for your message!

What you have encountered is a long-standing issue with dask. There is a github issue for it:

and a lot of active development happening in this area.

I strongly encourage you to engage on that github issue and describe your experience. Particularly useful would be to see the actual code you are using to trigger this memory overload. The dask gurus might have some useful workarounds.

rabernat · September 24, 2019, 7:21pm

Hi @dougiesquire–just wanted to follow up. If you could weigh in on the linked github issue, it would probably add a lot to the discussion and motivate the dask developers to continue to focus on this issue. The more people they see chiming in, the higher priority it might receive.

Let me know if there is any way I can help you with this.

geynard · September 24, 2019, 7:54pm

Hey, just chiming here to say that the problem might not come from Dask. We’ve experienced a similar behaviour with Ifremer cluster:

This was due to a problem in their system configuration, not sure where though. This happened on Lustre file system.

Maybe you can test the snippet given in the issue, and tell us if you can reproduce your problem.

geynard · September 24, 2019, 8:15pm

Forgot one thing: you may want to use the Dask Dashboard to see if your guess is correct. It would tell you a lot of information about what volume is being stored in memory on Dask side!

dougiesquire · September 25, 2019, 12:15am

Thanks for the feedback, @rabernat and @geynard!

I’ve posted this issue on the recommended github thread.

I should have been clearer in my original post - my references to “memory load” are as diagnosed by the Dask Dashboard. I’m pretty confident that my case is associated with Dask.

jbusecke · September 30, 2019, 9:06pm

I have recently faced the same problem. Thanks for the link @rabernat, ill post on the github issue.

pbranson · October 8, 2019, 11:57am

@dougiesquire I have had similar issues and the work around I have used for the CSIRO HPC is to allow spilling to disk as the nodes have local storage accessible at $TMPDIR - I believe this is better than using a location on the parallel file system due to the many small reads and writes that happen with caching. The workers will go orange, but your calculation may still complete without killing the workers.

Here is the jobqueue.yml that I use:

distributed:
worker:
memory:
target: 0.6
spill: 0.75
pause: 0.80 # fraction at which we pause worker threads
terminate: 0.95 # fraction at which we terminate the worker

jobqueue:
slurm:
cores: 4
memory: 32GB
processes: 2
queue: h2
walltime: 120
local-directory: $TMPDIR

When working on Pawsey HPC, the nodes dont have local storage, but if you are happy to use containers with Shifter you can create a so called ‘per-node-cache’ that is overlaid into the container filesystem and reduces the metadata load on the parallel file system, example here:

github.com

pbranson/pangeo-hpc-shifter/blob/master/start_worker.sh

#!/bin/bash -l

#SBATCH --partition=workq
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=6G
#SBATCH --time=2:00:00
#SBATCH --account=pawsey0106
#SBATCH --export=NONE
#SBATCH -J dask-worker   # name
#SBATCH -o dask-worker-%J.out

module load shifter

# calculate task memory limit
mempcpu=$SLURM_MEM_PER_CPU
memlim=$(echo $SLURM_CPUS_PER_TASK*$mempcpu*0.95 | bc)
container=pangeo/pangeo-notebook:latest

echo Memory limit is $memlim

This file has been truncated. show original

increase the size=4G part to suit your needs, I have used up to 40G per worker without issue.

Hopefully some of the ideas that @TomAugspurger and @rabernat have discussed in those github issues help in a future dask release.

dougiesquire · October 11, 2019, 12:42am

Thanks @pbranson! I’m hopeful of potential solutions implemented within dask, In the meantime, spilling to disk at least means the jobs will finish and these are great tips for Australian HPC systems.

Thomas_Moore · November 13, 2019, 3:53am

Not surprisingly, within a similar CSIRO Pangeo HPC environment including our new fast BeeGFS storage, I seem to be hitting similar issues. @pbranson my attempt to set the local-directory to $TMPDIR doesn’t appear to assist? This is anecdotal as I’ve not had time to do lots of comparisons. As someone not well versed in investigating Dask code under the hood I’m wondering what I could be logging or documenting that might help others explore a fix in Dask?

TomAugspurger · November 13, 2019, 12:49pm

Would there be a way to get me access to one of these machines so I can take a look at things? Or is anyone able to do a screen share with me?

Thomas_Moore · November 13, 2019, 5:17pm

Thanks for the reply @TomAugspurger - I’ll DM you.

Topic		Replies	Views
Optimizing climatology calculation with Xarray and Dask Science	33	3948	December 6, 2024
Optimizing Dask worker memory for writing Zarr files from GeoTIFs Data	7	229	September 7, 2024
Dask Execution Issue: Successful in Notebook, Fails in SLURM HPC	6	404	August 1, 2023
Xarray loading data locally when Dask is distributed Data	3	511	February 24, 2022
Xarray/Dask - Specify that a given task use huge amount of RAM to the Dask Ressource Manager Science	2	599	October 19, 2022

Very big memory load when using fast parallel file system

Related topics