I am a grad student working on tropical climate dynamics. I was excited to learn about Pangeo as it seems like an incredible way to do some more in depth analysis of CMIP6 data without having to download large 4d ocean variables on slow bandwidth.
So I went ahead to develop some code, which essentially extracts ocean velocities and temperature (along with wind stress and heat fluxes) in a tropical Pacific box with the aim of assessing feedbacks associated with ENSO across a selection of CMIP6 models.
While the code runs fine for 3d fields (like wind stress), it almost always crashes for 4d ocean variables. I get things like:
distributed.client - ERROR - Failed to reconnect to scheduler after 50.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
It happens seemingly regardless of how big my cluster is, Iāve tried changing chunk sizes and still my experience is that anytime the cluster gets in contact with a 4d variable, it almost always becomes unresponsive.
So my questions are:
Is this a problem with the dask and chunks that are set up in a wrong way that I can fix with more knowledge? OR
Is my aim of carrying out (heavy) calculations with 4d ocean variables even well suited for pangeo cloud?
I also have a proposition: If anyone is interested in developing an online, straightforward tool for assessing ENSO dynamics in CMIP6, and how they change with global warming, and have experience dask/Pangeo/etc., I would really love to collaborate. I think my idea is awesome and could be very useful for the climate science community, but Iāve also realized that I may need some help turning it into a reality. At least using Pangeo, because right now I feel a little bit like giving up on it.
What you want to do sounds reasonable and possible with Pangeo cloud. But itās true that working with 4D CMIP6 data can get complicated and computationally expensive.
Your computation is probably crashing because the workers are using too much memory. There are two basic solutions:
just configure your cluster with more memory
tweak your code to be more memory efficient.
For the latter, there are lots of options, but itās hard to make a specific suggestion without more detail. Could you share your code?
@jbusecke is our resident expert on CMIP6 cloud processing. Iām sure he would have some good suggestions.
Thank you very much for your helpful reply. Iāve made a repository on Github that Iāve shared with you and @jbusecke. Let me know if it works. Iād be more than happy to receive tips for how to make the code more memory efficient. And to get your input on whether what Iām trying to do could work well on Pangeo.
Sorry for the late reply. I took a (very brief) look at your code and have a few comments.
I think the main bottleneck will be āTo get ready to perform the analysis, I like to regrid the ocean data, to make it easier to work with. Note, this section may have to be modified depending on which model you use.ā Depending on how you set this up xesmf can consume a lot of memory. Is there a way you can break down the different steps and try to compute intermediate results to see āwhereā the calculation breaks?
Shameless self promotion alert!!! I am actively developing a package to make working with cmip6 data easier (and my primary use case currently are 4D ocean variables). There are some examples in there that could be helpful. I am also actively working on some higher level tools that will simplify your ābookkeepingā in the first few cells.
Could you write out the full calculation needed for the ENSO stability index? I am not very familiar with the topic, and would like to learn more.
Overall the idea seems really cool. Let me know if the above helps you to narrow down the problem. Happy to help further if needed.
So you suspect it is the regridding that is swallowing memory? Interesting. I will follow your approach to find out where it breaks down.
Wow, awesome, I will definitely check it out. Not self-promotion when itās highly relevant
Yes, I will upload the remainder of the code to github. Actually, the end goal is to have everything online, but because the cluster kept crashing, I decided to take some things offline for now.