Understanding how to use Dask Local Cluster with Xarray

Kathy_Pegion · October 22, 2020, 6:36pm

When I start the local cluster, I can specify n_workers and thread_per_worker .

Is there a good rule of thumb or procedure for how to set this optimally and test whether I’ve set it optimally based on my system information (e.g. number of cpus, memory, etc.)?

I have read a lot of documentation and stackoverflow, but I can’t seem to find a clear, clean explanation that I understand and can explain clearly to students in my Climate Data class.

I am hoping someone here can help. Thanks!

rabernat · October 22, 2020, 6:47pm

Thanks for sharing this interesting question!

In my experience, distributed’s LocalCluster with no options automatically chooses these parameters in a more-or-less optimal way. Are you finding problems with these defaults?

The general constraints are:

The total amount of memory in the cluster should not exceed the total amount of memory on your computer (and really should be less, since you presumably have other applications running)
The total number of cores should not exceed the number of cores in your computer. Most laptops only have 4 cores or less, so there is not much wiggle room here. Total cores is n_workers * threads_per_worker.

Within those constrains, you can play with different numbers of workers. You can also have just one worker with multiple threads. Threads can share memory while processes (and workers) cannot. I have never know how to decide between processes and threads.

I personally tend not to use the distributed schedule on a local machine, except when I need the dashboard to debug something. The default threaded scheduler seems to work just as well if not better. There is overhead involved in running the distributed scheduler.

Kathy_Pegion · October 24, 2020, 5:56pm

Thanks for the reply @rabernat. I’m working on my department linux cluster. For some reason I had the impression that I was supposed to specify and set threads and workers in some optimal way or I would not get my code distributed across all the available cores by default. Seems I misunderstood. I will do some testing.

TomAugspurger · October 24, 2020, 6:52pm

https://docs.dask.org/en/latest/setup/single-machine.html has a short discussion on choosing between a mix of threads and processes. Typically it comes down to whether or not your computation holds Python’s global interpreter lock.

dcherian · October 26, 2020, 3:19pm

Another consideration is that HDF5 cannot parallelize across threads. Since I work with large collections of netCDF files, and file reads are usually the slowest part of my workflow, I always go with multiple processes.

Topic		Replies	Views
Issue with dask.distributed in multiple nodes of a cluster HPC	16	2725	April 8, 2021
Optimizing climatology calculation with Xarray and Dask Science	33	3972	December 6, 2024
Xarray loading data locally when Dask is distributed Data	3	511	February 24, 2022
Xarray operations (e.g., preprocess) running locally (post open_mfdataset) instead of on Dask distributed cluster zarr	2	132	November 13, 2024
Parallel processing when using numpy Data	17	4030	March 28, 2022

Understanding how to use Dask Local Cluster with Xarray

Related topics