Any suggestions for efficiently operating over windows of data?

Leonard_Strnad · February 2, 2023, 1:30am

TL:DR on Efficiently slicing random windows for reduced xarray dataset and how I ended up doing this:

build a dataset with dims (“i”, “y”,“x”) where “i” indexes the number of valid elements
drop down to job.data.to_delayed().ravel() but be careful to use da.overlap.overlap first if you are worried about “chips” near the boundary of chunks
go count valid “chips” with a first pass only over data that determines valid “chips”. This takes 30s on a cluster over a single band using ~np.isnan() as our valid pixel identifier.
use this to provide deterministic shapes to a second pass extracting the data by applying a delayed function over the delayed chunks from the xr dataset to literally iterate over each valid chip in numpy code (we literally slice windows/chips one by one).
reduce these delayed objects back to dask arrays with da.from_delayed providing the shapes and possibly da.concatenate
and you will have a lazy “chip” dask array you can use to create an xarray dataarray/dataset.

This can take significant resources with many valid pixels when you actually call compute. This is a good stopping point for us and so we trigger computation by persisting to zarr for follow up ml operations. We do this to keep ml iteration faster and avoid referencing all these tasks required to build this dataset. Our largest chip dataset is roughly (20e6, 64, 15, 15) extracted from (2 years, 64 variables, 50000x, 50000y) i.e. 20million valid chips with 64 variables and 15x15 chips. They are sparse! It takes 5ish minutes with 32 workers 4threads and 16gb. We then build tfrecords, flatten to xgboost with dask, split + balance on coords, etc.

Topic		Replies	Views
xr.DataArray.chunks, np.digitize and xr.DataArray.groupby, and dask Science	2	674	January 16, 2022
Optimising Access For Zarr on S3 Data by LAT/LONG (Dask) Data	11	1648	April 25, 2022
Best practice reading zarr from s3 Cloud	8	4451	July 28, 2022
Xarray to_zarr interp_like via dask distributed - anyone done this before? Science	1	516	May 30, 2022
Saving larger-than-memory objects to zarr using dask and xarray Data zarr	9	575	December 3, 2024

Any suggestions for efficiently operating over windows of data?

Related topics