Thanks @keewis for the idea! I have not yet been able to implement it on my end - was struggling to get to the correct format for the categories
variable. Although I did have skimage.measure.label
functioning on test 2D arrays (without time dimension).
I did end up with an approach that is working for me based around xarray cumsum. This was inspired by python - Convert cumsum() output to binary array in xarray - Stack Overflow.
First, for value 1, calculate cumsum, but reset it each time 0 is found:
cumsum = cube.cumsum(dim = 'time') - cube.cumsum(dim = 'time').where(cube== 0).ffill(dim = 'time').fillna(0)
Next, find groups of value 1 that meet condition (thresh
= 3 consecutive 1’ observations, skipping NaNs, all observations that are part of this group will be kept.):
grps1 = xr.full_like(cumsum, fill_value = 0)
grps1 = xr.where(cumsum >= thresh, 1, grps1)
grps1 = xr.where((cumsum > 0) & (cumsum < thresh), np.nan, grps1 )
grps1 = grps1 .bfill(dim = 'time')
Then, flip the input cube and repeat this process (i.e., find groups of value 0s - which for cumsum have been set to value 1):
cube_flip = xr.where(cube == 1, 0, cube)
cube_flip = xr.where(cube == 0, 1, cube_flip)
# Next, repeat above, with grps0
Finally, create a cleaned cube based on these groups, and replace NaNs from the original cube to get the desired result (outliers removed, but otherwise original observations/NaNs remain intact):
cube_clean = xr.where(grps1 == 1, 1, np.nan)
cube_clean = xr.where(grps0 == 1, 0, cube_clean)
cube_clean = xr.where(cube.isnull(), np.nan, cube_clean)
This is not as short and clean as the potential of skimage.measure.label
and does not scale well if there are many groups that need to be cleaned, but is all dask-complient so runs pretty quick when data needs to be loaded into memory later.