Not sure if this is more of a Dask, Xarray or NetCDF question. It’s about how much data is loaded in memory when accessing a value or a subset of values in an Xarray DataArray. The questions occurred when preparing the FOSS4G material, and we did not find an easy answer by looking at various online docs.
So here it is: when loading a single value or a small subset of values in a Xarray.DataArray from a NetCDF file stored with internal native HDF5 chunks on an object store, how much data is first read and loaded into memory:
- The entire NetCDF file,
- Only one or several entire Dask array chunks necessary to load these values.
- Above but a bit more precise: every native NetCDF file chunks that are needed to load the above dask array chunks.
- Only the NetCDF file chunks needed to load the selected values, no matter the dask array chunks used.
So basically, when calling something like (LTS will be a 2 dimensional DataSet):
LTS = xr.open_dataset(fs.open(s3path), chunks='auto')
test = LTS['min'].sel(lat=slice(80.,70.), lon=slice(170.,180))
Do we load whole NetCDF file, only needed dask.array chunks, or only the portion of the file we want to read?
I was thinking the answer was 3., but my experiments here suggest that this is 4. (which is really great if true!).
Bonus question, what about using Numpy arrays? I guess it’s always 4. (and not 1.).
This is my understanding, so could be wrong, but it’s good you bring it up. We need to write it down somewhere.
Item 3 is correct AFAICT.
Selections or subsetting doesn’t propagate through a dask chunking step (e.g. dask issue). So you’re always reading a multiple of dask chunks. Those reads then bring in the necessary file format chunks.
If you didn’t use dask and just
open_dataset you get xarray’s lazy array handling. This will only read as many file format chunks as necessary for the subset you request.
This actually came up yesterday (cc @scotthq). In theory
- you should be able to pass an “area of interest” to the
preprocess function as part of
- subset the xarray lazy array to the area of interest in
preprocess (double check that you don’t get a dask array in
preprocess, that’s key)
- then chunk the subsetted data and return a dask-ified Dataset from
This gets you minimal data reads and dask-based parallelism for the rest of the computation. I’m pretty sure I’ve seen people do this (do heavy subsetting in
preprocess instead of later) and have it work much better. Though I think a lot of the advantage of doing this was negated by this dask PR (previously dask would slice, create a view ot the whole array, so you didn’t get much memory savings from subsetting. This is not true anymore).
Let’s fix this at the source. It’s an obvious and straightforward performance enhancement.
Yes this is what I thought. Thanks a lot for the quick answer.
However, this is not what I’m observing, it seems there are other optimizations somewhere, or my tests are not correct. Does anybody has an idea on how to really check what’s happening? Should I monitor incoming network traffic or disk IO? I’m currently only relying on the Dask Dashboard, and I don’t see entire chunks loaded (or at least staying) into worker’s memory when accessing only a small subset of values. And accessing this small subset is always taking the same time, no matter the Dask/Xarray chunk size I use.
This is definitely something that should at least be written down somewhere (maybe it is in some doc on Dask side? Didn’t find it).
What about using
strace on the jupyter lab instance, & dask worker instances? Also, the test data we used is compressed. (We have several data array, each 2.5G in total, but all of them in one netcdf, and the netcdf file was about 600M?). So we probably need to estimate the data transfer size with the effect of compression? May be it is easier for us to test with bigger dataset like the example of Kerchunk here (and/or smaller dask worker & dask option without spill etc)
That’s a really good remark!!