Xarray loading data locally when Dask is distributed

Apologies if this isn’t the right place, but I see other xarray discussion here.

I’m looking at running Dask workers in different datacenters, where workers are only able to access the data hosted locally. I’ve set up a simulation with Docker containers.

I’m seeing an odd issue when loading a netCDF file with xarray.open_dataset in a tiny Dask cluster. In some circumstances the data appears to be be available despite not being accessible to (mounted on the filesystem of) the workers.

Is it possible that the client machine (which can see the data) is fetching it and sending it to the workers for processing? If so, is that a thing that I can control in any way?

Thanks in anticipation

1 Like

Welcome @dmcg! This is a great place to ask such questions. But I have to confess that I don’t quite follow what you’re asking. You are describing a rather complex and highly customized scenario with lots of different moving parts. It’s very hard for an outsider to provide useful advice here.

If you can simplify this down to a more specific example, ideally with some code that can be used to reproduce the problem, we could probably be more helpful.

Thanks, and yes, it is pretty obscure I’m afraid!

I have a Jupyter notebook that shows the issue, but it’s pretty tortuous too. Let me see how small I can make the example and come back here then.

OK, I think I’ve worked out what is happening. A quantity that I thought was array data was actually metadata, and therefore in memory in the client after the open_dataset call. So I wasn’t actually performing calculations on an array, and the workers didn’t have to read the NetCDF file.

It’s only taken 4 days of my life to come to this conclusion :wink:

1 Like