Hi,
We are thinking about changing the value that is behind the “auto” chunk size configuration by default. This has historically been at 128MiB for a very long time now. (You can stop reading if you are not using “auto” for your chunks).
NumPy is pretty fast on 128MiB arrays, which oftentimes means that the scheduler is a bottleneck because each individual task is too short. Choosing a larger chunk size will alleviate a lot of the pressure that is on the scheduler.
Additionally, choosing a bigger value here will directly correlate to the size of the graph for folks who are using the auto setting. The graph will get significantly smaller and thus easier to handle for all parties involved.
TLDR: A bigger value for the default auto chunk size should translate to smaller graphs and less strain on the scheduler. It makes large scale computations easier to handle and gives better performance.
Generally, the memory usage per worker will be a bit higher with bigger chunks in this scenario. This is nothing that will make workloads fail if they ran fine before. It’s only something to be concerned about if you deploy Dask in a way that is significantly different than 4GB of memory per Core, i.e. 2GB per thread or less.
This post is intended to get feedback, so please speak up if you have thoughts on this either here or on the associated GitHub issue Better chunk size value for chunks=auto setting · Issue #11342 · dask/dask · GitHub
Greetings