High unmanaged memory warning with Dask when measuring NetCDF read throughput from Google Cloud Storage

rabernat · June 3, 2022, 1:13pm

Ok, I think the answer is straightforward. This file ETOPO1_Ice_g_gmt4.nc is a netCDF3 file. NetCDF3 files do not support internal chunking of the data. All data variables are stored as simple flat binary data in C-order. The only way to read it over the internet is with engine='scipy'. This engine does not support lazy loading of data from fsspec filesystems, as documented in this issue:

github.com/pangeo-forge/pangeo-forge-recipes

Memory spike for lazily accessing netCDF3 file with scipy backend

opened 04:00PM - 10 May 22 UTC

rabernat

I am trying to create a recipe with some netCDF3 files based on this example - h…ttps://discourse.pangeo.io/t/dask-xarray-and-swap-memory-polution-on-local-linux-cluster/2453 I have discovered an issue with the way netCDF3 / scipy / fsspec interact. The main issue is the [scipy netcdf](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.io.netcdf.netcdf_file.html) function has an `mmap` option which provides lazy access which only works with local files. It seems that otherwise we are eagerly loading the data. I profiled the following cases using fil. ### Local file ```python %%filprofile url = "gcs://leap-scratch/rabernat/ERA5_HiRes_Hourly/cache/de66f7c4c230a196f5fad34f35355df2-https_cluster.klima.uni- with fsspec.open("simplecache::" + url, "rb") as fp: ds = xr.open_dataset(fp.name, engine='scipy', backend_kwargs={'mmap': True}) ``` <img width="1051" alt="image" src="https://user-images.githubusercontent.com/1197350/167677145-ea2fa76b-2828-41e2-b00a-61a497757e52.png"> ## fsspec local filesystem ```python %%filprofile with fsspec.open("simplecache::" + url, "rb") as fp: ds = xr.open_dataset(fp, engine='scipy', backend_kwargs={'mmap': False}) ``` <img width="1054" alt="image" src="https://user-images.githubusercontent.com/1197350/167677348-fd8a7df1-db68-4549-b451-200291da78a6.png"> ### Using gcsfs ```python %%filprofile bremen.de_fmaussion_teaching_climate_dask_exps_hires_hourly_surf_era5_hires_hourly_tp_2000_01.nc" with fsspec.open(url, "rb") as fp: ds = xr.open_dataset(fp, engine='scipy', backend_kwargs={'mmap': False}) ``` <img width="1066" alt="image" src="https://user-images.githubusercontent.com/1197350/167676904-fc71fb8f-5313-4d96-bd1b-c90be3480704.png">

The chunks that get created when you call dsa.from_array(da) are spurious and not helpful. Every time you try to compute a single chunk, the entire array has to be read. So since you have 70 chunks, you will use 70x the memory of the original array.

I would retry this exercise with a NetCDF4 file with appropriately configured internal chunks and see if you do any better.

Topic		Replies	Views
Unclear behavior of NetCDF4 files loaded with intake-xarray Data	6	686	June 17, 2022
Very big memory load when using fast parallel file system HPC	11	1798	November 13, 2019
Using Dask client and running out of memory Science	8	4156	June 22, 2023
Xarray loading data locally when Dask is distributed Data	3	511	February 24, 2022
Reading a Larger than RAM NetCDF4 using Xarray Data zarr	7	124	June 24, 2025

High unmanaged memory warning with Dask when measuring NetCDF read throughput from Google Cloud Storage

Related topics