Challenges in Accessing 3D CMIP6 Ocean Variables (e.g., thetao, so)

I am trying to download (and regrid) CMIP6 climate data download through Google Cloud.

I am roughly following the Pangeo CMIP6 tutorial

One issue I encounter is when I download and save 2D ocean variables (e.g. sea surface temperature or “tos”, or sea surface salinity or “sos”) the download finishes very quickly (within a minute). However, if I instead download and save the top level of the 3D ocean fields such as “thetao” or “so”, the code hangs and never completes. A 2D ocean field for one CMIP6 historical ensemble member is around 980MB, so it shouldn’t take very long on a decent internet connection. I am not sure why downloading from the 3D data is so much slower.

I am really not sure why there is a difference in performance here. I tried chunking the levels into chunks of size 1, but it didn’t help.

Any advice would be much appreciated! Here is the code I am using, which works for “tos” but not the top level of “thetao”. download_CMIP6_minimal_working.py · GitHub

Hey @minminfu,

I suspect the issue here is that by writing to a single netcdf you might be loosing parallelism. Can you try writing to zarr to test this?