High unmanaged memory warning with Dask when measuring NetCDF read throughput from Google Cloud Storage

jgreen · May 31, 2022, 2:44pm

Following the method that @rabernat presented in his cloud storage read throughput demonstration from 2020, I am attempting to benchmark speedups achieved by cloud native storage formats and compare them with their “classical” counterparts (CSV, NetCDF, grib2 → Parquet, Zarr, etc.). So far, everything is working fine except for measuring throughput for NetCDF & monolithic Parquet files. This post focuses on addressing issues I have with NetCDF.

For background, I am setting up a local cluster on a VM with ~30GB usable RAM and 8 cores. Using the intake library in Python allows me to point to NetCDF data in Google Cloud Storage without passing security credentials around a cluster (for when I will eventually scale up my amount of parallel reads).

When I attempt to access the entire set by storing in a null storage target, I receive a high unmanaged memory error. Oddly enough, this does not happen when I load the same data in CSV format. As I monitor the memory usage during the execution of the read cell, it reaches ~20GB before the warning is raised by Dask. I have already attempted to change the chunk size, but not matter how small or large the chunks are the warning persists.

From previous testing, I know that the error occurs on the dask.compute(future, retries=5) line. I have spent quite a while attempting to come up with a solution to this problem, but even my mentors haven’t been able to nail down the specifics of the issue. Any guidance on a possible fix to this issue would be greatly appreciated, and please let me know if more information is needed. You can find my entire benchmarking in my GitHub repository.

EDIT: The specific test data I am using is from the ETOPO1 Global Relief Model. I am an intern for a parallel processing company but study Aerospace Engineering, so I’m still a newbie to the world of big climate data and processing.

rabernat · June 2, 2022, 2:35pm

Hi @jgreen–thanks for the interesting question!

I wanted to investigate it, but I found that this file - gs://cloud-data-benchmarks/ETOPO1_Ice_g_gmt4.nc - is not public. Could you make it public so others can access the file and run your code?

jgreen · June 2, 2022, 3:09pm

@rabernat Sorry about this! Unfortunately, it is my company’s Google bucket that requires a private token that I am not allowed to distribute. I will ask my supervisors about creating a separate bucket that you can access to run this notebook and let you know if/when it is ready.

Also, it is worth noting that when I increased the worker amount to 5 instead of 1, the NetCDF section of the notebook ran just fine.

jgreen · June 2, 2022, 6:58pm

Only the NetCDF file should be public now. If you change

data = intake.open_netcdf('gs://cloud-data-benchmarks/ETOPO1_Ice_g_gmt4.nc').to_dask()

to

data = intake.open_netcdf('https://storage.googleapis.com/cloud-data-benchmarks/ETOPO1_Ice_g_gmt4.nc').to_dask()

the NetCDF section of the notebook should function correctly. Thanks again for taking a look at this issue!

rabernat · June 3, 2022, 1:13pm

Ok, I think the answer is straightforward. This file ETOPO1_Ice_g_gmt4.nc is a netCDF3 file. NetCDF3 files do not support internal chunking of the data. All data variables are stored as simple flat binary data in C-order. The only way to read it over the internet is with engine='scipy'. This engine does not support lazy loading of data from fsspec filesystems, as documented in this issue:

github.com/pangeo-forge/pangeo-forge-recipes

Memory spike for lazily accessing netCDF3 file with scipy backend

opened 04:00PM - 10 May 22 UTC

rabernat

I am trying to create a recipe with some netCDF3 files based on this example - h…ttps://discourse.pangeo.io/t/dask-xarray-and-swap-memory-polution-on-local-linux-cluster/2453 I have discovered an issue with the way netCDF3 / scipy / fsspec interact. The main issue is the [scipy netcdf](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.io.netcdf.netcdf_file.html) function has an `mmap` option which provides lazy access which only works with local files. It seems that otherwise we are eagerly loading the data. I profiled the following cases using fil. ### Local file ```python %%filprofile url = "gcs://leap-scratch/rabernat/ERA5_HiRes_Hourly/cache/de66f7c4c230a196f5fad34f35355df2-https_cluster.klima.uni- with fsspec.open("simplecache::" + url, "rb") as fp: ds = xr.open_dataset(fp.name, engine='scipy', backend_kwargs={'mmap': True}) ``` <img width="1051" alt="image" src="https://user-images.githubusercontent.com/1197350/167677145-ea2fa76b-2828-41e2-b00a-61a497757e52.png"> ## fsspec local filesystem ```python %%filprofile with fsspec.open("simplecache::" + url, "rb") as fp: ds = xr.open_dataset(fp, engine='scipy', backend_kwargs={'mmap': False}) ``` <img width="1054" alt="image" src="https://user-images.githubusercontent.com/1197350/167677348-fd8a7df1-db68-4549-b451-200291da78a6.png"> ### Using gcsfs ```python %%filprofile bremen.de_fmaussion_teaching_climate_dask_exps_hires_hourly_surf_era5_hires_hourly_tp_2000_01.nc" with fsspec.open(url, "rb") as fp: ds = xr.open_dataset(fp, engine='scipy', backend_kwargs={'mmap': False}) ``` <img width="1066" alt="image" src="https://user-images.githubusercontent.com/1197350/167676904-fc71fb8f-5313-4d96-bd1b-c90be3480704.png">

The chunks that get created when you call dsa.from_array(da) are spurious and not helpful. Every time you try to compute a single chunk, the entire array has to be read. So since you have 70 chunks, you will use 70x the memory of the original array.

I would retry this exercise with a NetCDF4 file with appropriately configured internal chunks and see if you do any better.

jgreen · June 3, 2022, 4:15pm

Converting to NetCDF4 & rerunning worked wonderfully, I really appreciate the help!

Topic		Replies	Views
Unclear behavior of NetCDF4 files loaded with intake-xarray Data	6	689	June 17, 2022
Using Dask client and running out of memory Science	8	4238	June 22, 2023
Reading a Larger than RAM NetCDF4 using Xarray Data zarr	7	162	June 24, 2025
Strange issue with .compute() of Dask array values Science	15	924	October 17, 2020
Technical question about reading file by chunks and Xarray/Dask/NetCDF Data	5	1888	August 30, 2022

High unmanaged memory warning with Dask when measuring NetCDF read throughput from Google Cloud Storage

Related topics