Technical question about reading file by chunks and Xarray/Dask/NetCDF

geynard · August 25, 2022, 2:19pm

Hi everyone,

Not sure if this is more of a Dask, Xarray or NetCDF question. It’s about how much data is loaded in memory when accessing a value or a subset of values in an Xarray DataArray. The questions occurred when preparing the FOSS4G material, and we did not find an easy answer by looking at various online docs.

So here it is: when loading a single value or a small subset of values in a Xarray.DataArray from a NetCDF file stored with internal native HDF5 chunks on an object store, how much data is first read and loaded into memory:

The entire NetCDF file,
Only one or several entire Dask array chunks necessary to load these values.
Above but a bit more precise: every native NetCDF file chunks that are needed to load the above dask array chunks.
Only the NetCDF file chunks needed to load the selected values, no matter the dask array chunks used.

So basically, when calling something like (LTS will be a 2 dimensional DataSet):

LTS = xr.open_dataset(fs.open(s3path), chunks='auto')
test = LTS['min'].sel(lat=slice(80.,70.), lon=slice(170.,180))
test.load()

Do we load whole NetCDF file, only needed dask.array chunks, or only the portion of the file we want to read?

I was thinking the answer was 3., but my experiments here suggest that this is 4. (which is really great if true!).

Bonus question, what about using Numpy arrays? I guess it’s always 4. (and not 1.).

Best,

dcherian · August 25, 2022, 4:20pm

This is my understanding, so could be wrong, but it’s good you bring it up. We need to write it down somewhere.

Item 3 is correct AFAICT.

Selections or subsetting doesn’t propagate through a dask chunking step (e.g. dask issue). So you’re always reading a multiple of dask chunks. Those reads then bring in the necessary file format chunks.

If you didn’t use dask and just open_dataset you get xarray’s lazy array handling. This will only read as many file format chunks as necessary for the subset you request.

This actually came up yesterday (cc @scotthq). In theory

you should be able to pass an “area of interest” to the preprocess function as part of open_mfdataset with chunks=False,
subset the xarray lazy array to the area of interest in preprocess (double check that you don’t get a dask array in preprocess, that’s key)
then chunk the subsetted data and return a dask-ified Dataset from preprocess.

This gets you minimal data reads and dask-based parallelism for the rest of the computation. I’m pretty sure I’ve seen people do this (do heavy subsetting in preprocess instead of later) and have it work much better. Though I think a lot of the advantage of doing this was negated by this dask PR (previously dask would slice, create a view ot the whole array, so you didn’t get much memory savings from subsetting. This is not true anymore).

rabernat · August 26, 2022, 1:40pm

Let’s fix this at the source. It’s an obvious and straightforward performance enhancement.

geynard · August 29, 2022, 12:15pm

Yes this is what I thought. Thanks a lot for the quick answer.

However, this is not what I’m observing, it seems there are other optimizations somewhere, or my tests are not correct. Does anybody has an idea on how to really check what’s happening? Should I monitor incoming network traffic or disk IO? I’m currently only relying on the Dask Dashboard, and I don’t see entire chunks loaded (or at least staying) into worker’s memory when accessing only a small subset of values. And accessing this small subset is always taking the same time, no matter the Dask/Xarray chunk size I use.

This is definitely something that should at least be written down somewhere (maybe it is in some doc on Dask side? Didn’t find it).

tinaok · August 30, 2022, 12:52pm

What about using strace on the jupyter lab instance, & dask worker instances? Also, the test data we used is compressed. (We have several data array, each 2.5G in total, but all of them in one netcdf, and the netcdf file was about 600M?). So we probably need to estimate the data transfer size with the effect of compression? May be it is easier for us to test with bigger dataset like the example of Kerchunk here (and/or smaller dask worker & dask option without spill etc)

geynard · August 30, 2022, 1:41pm

That’s a really good remark!!

Topic		Replies	Views
Unclear behavior of NetCDF4 files loaded with intake-xarray Data	6	687	June 17, 2022
Xarray loading data locally when Dask is distributed Data	3	511	February 24, 2022
Struggling with large dataset loading/reading using xarray Science	39	15214	February 16, 2023
Avoid metadata reads when loading many similar NetCDF files Data	4	536	August 3, 2023
Strange issue with .compute() of Dask array values Science	15	912	October 17, 2020

Technical question about reading file by chunks and Xarray/Dask/NetCDF

Related topics