Dask not completing large operations on SOSE data

eejco · September 28, 2022, 11:04am

Hi everyone,

I’m working through the Pangeo SOSE example using the jupyter notebook on github which I’ve copied into the Pangeo cloud environment.
Everything runs smoothly until I try to load the data in the ‘Validate Budget’ section, on the following line:

th_vert = check_vertical(budget_th.isel(**time_slice), ‘TH’).load()

Each time I try this, the dask dashboard shows that work gradually slows and then stops. I’ve tried leaving it for over 10 minutes but it doesn’t start again, and have also tried using different sizes of servers.

Things are still changing in the Worker tab, and when I check the worker logs there are errors, including:

ERROR - Decompression failed: corrupt input or insufficient space in destination buffer. Error code: 12

ERROR - Invalid size: 0x3318063313

ERROR - failed during get data with tls://10.8.21.3:40055 → tls://10.8.23.4:40111

I’m no data scientist, so I’d be very grateful if anyone had some insight into what’s happening or how I could debug it! I also appreciate that this code hasn’t been updated for a while so am not necessarily expecting to be able to fix the problem here, however, I am running into the same issue when working on my own project with the SOSE data so I think there is something more general going on.

Any help would be very appreciated, and apologies if I’ve missed anything obvious or posted this in the wrong place, I’m still new to all of this!

Thanks!

rabernat · September 28, 2022, 3:28pm

We have experienced some performance regressions with a recent Dask version. (See Dask cluster stays idle for a long time before computing - #3 by stb2145 for a related issue.)

Could you re-try the same thing on https://staging.us-central1-b.gcp.pangeo.io/ and see if the problem persists?

eejco · September 28, 2022, 3:58pm

Thanks, Ryan!
It was looking good for a minute or two but stopped suddenly with this error:

rabernat · September 28, 2022, 7:41pm

It seems like you’re probably running out of worker memory.

Can you try just giving the workers a bit more memory? The Pangeo Cloud docs explain how to do this.

Also, did you ever have this example working in the past?

eejco · September 29, 2022, 11:56am

Thanks Ryan, that’s really helpful, it’s working a lot better now. Is it ok if I continue to access Pangeo using the link you’ve shared?

lei · October 2, 2022, 3:59am

Hi,@eejco
I just test this case and meet the same error as yours.
I also tried to give larger memory:

cluster.scale(40)
options = gateway.cluster_options()
options.worker_memory = 2

And also move to the new link @rabernat provided above. As you said, it is much faster for processing. But still failed to finish the whole processing.

@eejco , did you run this case successful now?

If I close the cluster, I can run this load() code with 2 min. It means the loading data are not very huge and easy to process in simple way. So, I have this question: under what condition should we use dask?

I also have another question: where dose the dask memory come from? For example, the max memory for Pangeo user is about 60G. But I can give dask more than 60G (such 40*2=80G) and it still works.

eejco · October 3, 2022, 9:13am

Thanks @lei , that’s really helpful!

I was using a higher worker memory (8) and cluster.adapt(). I got a bit further through the notebook, but hit errors again when computing the histograms later on.

It’s interesting what you say about just loading the data without using the dask cluster, I tried this too and am finding it faster and more reliable.

lei · October 3, 2022, 2:15pm

Yes. The histograms fails too when not use the dask. It is perhaps due to the large memory cost.
I just comment the histograms section and things go well for the below codes.

Topic		Replies	Views
Issues getting started with Xarray and Dask on Pangeo Cloud	9	1949	February 8, 2021
Dask cluster stays idle for a long time before computing Pangeo Cloud Support	2	737	September 19, 2022
Dask RESAMPLING (can't handle large files...) Science	7	1856	February 1, 2023
Using Dask client and running out of memory Science	8	4196	June 22, 2023
Hitting memory limit converting CMIP6 to numpy array Pangeo Cloud Support	2	1371	August 21, 2020

Dask not completing large operations on SOSE data

Related topics