Am I thinking about this data processing/chunking workflow correctly?

chase · May 31, 2023, 8:24pm

So I’ve had some success after some initial troubles with rechunking. I ran into an issue where workers are exceeding the maximum memory usage, but only on the copy_intermediate_to_write task. A similar issue is on Github, however the fix in that issue thread does not work for me.

What I have gotten to work for me, is to just only allow the worker memory to be 2GB out of the available 8GB for the worker. Watching the dask dashboard I can see the memory usage going up into 5-6GB per worker, even though my max memory is set to 2GB for rechunker like below.

Is this likely a problem with dask or just an oddity of perhaps how my source data is structured? It seems like a possible fix would be to find a way increase the number of copy_intermediate_to_write_tasks, since that’s the memory-intensive operation, but I didn’t see any way to do that.

cluster = FargateCluster(
    n_workers=24,
    worker_cpu=1024, # 1 vCPU
    worker_mem=8192, # 8GB
    image="pangeo/pangeo-notebook",
    cloudwatch_logs_group=f"dask-climo-{time_string}",
    environment=get_aws_credentials(),
)

client = Client(cluster)
rechunked = rechunk(
    ds,
    target_chunks=dict(lat=10, lon=10, model=1, scenario=-1, time=39411),
    target_store=rechunker_target,
    temp_store=rechunker_int,
    max_mem=f"{8192 // 4}MB",
    executor="dask",
)

Topic		Replies	Views
Optimising Access For Zarr on S3 Data by LAT/LONG (Dask) Data	11	1621	April 25, 2022
Extremely slow rechunking of Zarr store with xarray Data	16	3810	October 22, 2021
xr.DataArray.chunks, np.digitize and xr.DataArray.groupby, and dask Science	2	670	January 16, 2022
Slow Zarr to Netcdf Data	4	620	April 7, 2021
Netcdf to Zarr best practices Data	13	9911	February 10, 2021

Am I thinking about this data processing/chunking workflow correctly?

Related topics