Any suggestions for efficiently operating over windows of data?

TL:DR on Efficiently slicing random windows for reduced xarray dataset and how I ended up doing this:

  • build a dataset with dims (“i”, “y”,“x”) where “i” indexes the number of valid elements
  • drop down to job.data.to_delayed().ravel() but be careful to use da.overlap.overlap first if you are worried about “chips” near the boundary of chunks
  • go count valid “chips” with a first pass only over data that determines valid “chips”. This takes 30s on a cluster over a single band using ~np.isnan() as our valid pixel identifier.
  • use this to provide deterministic shapes to a second pass extracting the data by applying a delayed function over the delayed chunks from the xr dataset to literally iterate over each valid chip in numpy code (we literally slice windows/chips one by one).
  • reduce these delayed objects back to dask arrays with da.from_delayed providing the shapes and possibly da.concatenate
  • and you will have a lazy “chip” dask array you can use to create an xarray dataarray/dataset.

This can take significant resources with many valid pixels when you actually call compute. This is a good stopping point for us and so we trigger computation by persisting to zarr for follow up ml operations. We do this to keep ml iteration faster and avoid referencing all these tasks required to build this dataset. Our largest chip dataset is roughly (20e6, 64, 15, 15) extracted from (2 years, 64 variables, 50000x, 50000y) i.e. 20million valid chips with 64 variables and 15x15 chips. They are sparse! It takes 5ish minutes with 32 workers 4threads and 16gb. We then build tfrecords, flatten to xgboost with dask, split + balance on coords, etc.