So I’ve had some success after some initial troubles with rechunking. I ran into an issue where workers are exceeding the maximum memory usage, but only on the copy_intermediate_to_write
task. A similar issue is on Github, however the fix in that issue thread does not work for me.
What I have gotten to work for me, is to just only allow the worker memory to be 2GB out of the available 8GB for the worker. Watching the dask
dashboard I can see the memory usage going up into 5-6GB per worker, even though my max memory is set to 2GB for rechunker
like below.
Is this likely a problem with dask
or just an oddity of perhaps how my source data is structured? It seems like a possible fix would be to find a way increase the number of copy_intermediate_to_write_tasks
, since that’s the memory-intensive operation, but I didn’t see any way to do that.
cluster = FargateCluster(
n_workers=24,
worker_cpu=1024, # 1 vCPU
worker_mem=8192, # 8GB
image="pangeo/pangeo-notebook",
cloudwatch_logs_group=f"dask-climo-{time_string}",
environment=get_aws_credentials(),
)
client = Client(cluster)
rechunked = rechunk(
ds,
target_chunks=dict(lat=10, lon=10, model=1, scenario=-1, time=39411),
target_store=rechunker_target,
temp_store=rechunker_int,
max_mem=f"{8192 // 4}MB",
executor="dask",
)