Xarray operations (e.g., preprocess) running locally (post open_mfdataset) instead of on Dask distributed cluster

For reference, I’ve also asked the same question at

and in the coiled slack page. I will update this discussion in case I have a positive answer from those places.

I’m using dask.distributed and a coiled cluster to open with xarray multiple NetCDF files. Nothing fancy here.

But there is also a preprocess_xarray function running during the opening of files. This preprocess_xarray function is completely “empty” in my 1st showcase example. However, all the NetCDF files are being downloaded to the machine running the cluster (I’m monitoring my network usage). If this function had stuff in it, It would use my local memory …

I’ve tried many things, adding a @delayed decorator to force this function to run on the cluster… and so on with no success.

I have a reproducible example here code to showcase issues with dask distributed running preprocess on local machine instead of cluster · GitHub . this is a very simplified version of a bigger script available at GitHub - aodn/aodn_cloud_optimised: Cloud optimised data formats

Also the reason why the preprocess_xarray is not part of the class in my example is because that function can’t be serialized and used with partial.

Here is another version of that gist, but slightly different. The preprocess function is creating a dummy variable, and all of the memory of my local machine is being used

Cheers,
Loz

Answered there…looks like your client is not running anything?

for reference, answered here Xarray operations (e.g., preprocess) running locally (post open_mfdataset) instead of on Dask distributed cluster · dask/distributed · Discussion #8913 · GitHub

cause is a s3fs bug