High unmanaged memory warning with Dask when measuring NetCDF read throughput from Google Cloud Storage

Ok, I think the answer is straightforward. This file ETOPO1_Ice_g_gmt4.nc is a netCDF3 file. NetCDF3 files do not support internal chunking of the data. All data variables are stored as simple flat binary data in C-order. The only way to read it over the internet is with engine='scipy'. This engine does not support lazy loading of data from fsspec filesystems, as documented in this issue:

The chunks that get created when you call dsa.from_array(da) are spurious and not helpful. Every time you try to compute a single chunk, the entire array has to be read. So since you have 70 chunks, you will use 70x the memory of the original array.

I would retry this exercise with a NetCDF4 file with appropriately configured internal chunks and see if you do any better.

1 Like