Following the method that @rabernat presented in his cloud storage read throughput demonstration from 2020, I am attempting to benchmark speedups achieved by cloud native storage formats and compare them with their “classical” counterparts (CSV, NetCDF, grib2 → Parquet, Zarr, etc.). So far, everything is working fine except for measuring throughput for NetCDF & monolithic Parquet files. This post focuses on addressing issues I have with NetCDF.
For background, I am setting up a local cluster on a VM with ~30GB usable RAM and 8 cores. Using the intake library in Python allows me to point to NetCDF data in Google Cloud Storage without passing security credentials around a cluster (for when I will eventually scale up my amount of parallel reads).
When I attempt to access the entire set by storing in a null storage target, I receive a high unmanaged memory error. Oddly enough, this does not happen when I load the same data in CSV format. As I monitor the memory usage during the execution of the read cell, it reaches ~20GB before the warning is raised by Dask. I have already attempted to change the chunk size, but not matter how small or large the chunks are the warning persists.
From previous testing, I know that the error occurs on the dask.compute(future, retries=5) line. I have spent quite a while attempting to come up with a solution to this problem, but even my mentors haven’t been able to nail down the specifics of the issue. Any guidance on a possible fix to this issue would be greatly appreciated, and please let me know if more information is needed. You can find my entire benchmarking in my GitHub repository.
EDIT: The specific test data I am using is from the ETOPO1 Global Relief Model. I am an intern for a parallel processing company but study Aerospace Engineering, so I’m still a newbie to the world of big climate data and processing.
I wanted to investigate it, but I found that this file - gs://cloud-data-benchmarks/ETOPO1_Ice_g_gmt4.nc - is not public. Could you make it public so others can access the file and run your code?
@rabernat Sorry about this! Unfortunately, it is my company’s Google bucket that requires a private token that I am not allowed to distribute. I will ask my supervisors about creating a separate bucket that you can access to run this notebook and let you know if/when it is ready.
Also, it is worth noting that when I increased the worker amount to 5 instead of 1, the NetCDF section of the notebook ran just fine.
Ok, I think the answer is straightforward. This file ETOPO1_Ice_g_gmt4.nc is a netCDF3 file. NetCDF3 files do not support internal chunking of the data. All data variables are stored as simple flat binary data in C-order. The only way to read it over the internet is with engine='scipy'. This engine does not support lazy loading of data from fsspec filesystems, as documented in this issue:
The chunks that get created when you call dsa.from_array(da) are spurious and not helpful. Every time you try to compute a single chunk, the entire array has to be read. So since you have 70 chunks, you will use 70x the memory of the original array.
I would retry this exercise with a NetCDF4 file with appropriately configured internal chunks and see if you do any better.