Dask not completing large operations on SOSE data

Hi everyone,

I’m working through the Pangeo SOSE example using the jupyter notebook on github which I’ve copied into the Pangeo cloud environment.
Everything runs smoothly until I try to load the data in the ‘Validate Budget’ section, on the following line:

th_vert = check_vertical(budget_th.isel(**time_slice), ‘TH’).load()

Each time I try this, the dask dashboard shows that work gradually slows and then stops. I’ve tried leaving it for over 10 minutes but it doesn’t start again, and have also tried using different sizes of servers.

Things are still changing in the Worker tab, and when I check the worker logs there are errors, including:

ERROR - Decompression failed: corrupt input or insufficient space in destination buffer. Error code: 12

ERROR - Invalid size: 0x3318063313

ERROR - failed during get data with tls://10.8.21.3:40055 → tls://10.8.23.4:40111

I’m no data scientist, so I’d be very grateful if anyone had some insight into what’s happening or how I could debug it! I also appreciate that this code hasn’t been updated for a while so am not necessarily expecting to be able to fix the problem here, however, I am running into the same issue when working on my own project with the SOSE data so I think there is something more general going on.

Any help would be very appreciated, and apologies if I’ve missed anything obvious or posted this in the wrong place, I’m still new to all of this!

Thanks!

1 Like

We have experienced some performance regressions with a recent Dask version. (See Dask cluster stays idle for a long time before computing - #3 by stb2145 for a related issue.)

Could you re-try the same thing on https://staging.us-central1-b.gcp.pangeo.io/ and see if the problem persists?

Thanks, Ryan!
It was looking good for a minute or two but stopped suddenly with this error:


It seems like you’re probably running out of worker memory.

Can you try just giving the workers a bit more memory? The Pangeo Cloud docs explain how to do this.

Also, did you ever have this example working in the past?

Thanks Ryan, that’s really helpful, it’s working a lot better now. Is it ok if I continue to access Pangeo using the link you’ve shared?

1 Like

Hi,@eejco
I just test this case and meet the same error as yours.
I also tried to give larger memory:

cluster.scale(40)
options = gateway.cluster_options()
options.worker_memory = 2

And also move to the new link @rabernat provided above. As you said, it is much faster for processing. But still failed to finish the whole processing.

@eejco , did you run this case successful now?

If I close the cluster, I can run this load() code with 2 min. It means the loading data are not very huge and easy to process in simple way. So, I have this question: under what condition should we use dask?

I also have another question: where dose the dask memory come from? For example, the max memory for Pangeo user is about 60G. But I can give dask more than 60G (such 40*2=80G) and it still works.

1 Like

Thanks @lei , that’s really helpful!

I was using a higher worker memory (8) and cluster.adapt(). I got a bit further through the notebook, but hit errors again when computing the histograms later on.

It’s interesting what you say about just loading the data without using the dask cluster, I tried this too and am finding it faster and more reliable.

Yes. The histograms fails too when not use the dask. It is perhaps due to the large memory cost.
I just comment the histograms section and things go well for the below codes.

1 Like