Xarray time-series, how to remove local outliers?

Hello,

I am using geospatial packages from the open source Python ecosystem (e.g., stackstac, xarray, dask) to do some work and am stuck on a step so I thought I would make a post here. Please let me know if there is a better place to share these sorts of questions, thanks!

So, at this point in my workflow I have an xarray dataArray representing a tile in a larger area (77 timesteps, 2001 x 2001 pixels, float32 numpy.ndarray). It is a classification comprised of 1s (presence of a ground feature), 0s (absence of that feature, but still a ground observation), and NaNs (no ground observation, e.g., clouds).

I am interested in understanding things like how long the feature exists on the ground, when it arrives, when it leaves etc. However, before I can calculate this, I need to clean the time-series cube. Take this single pixel time-series, for example (with NaNs dropped for visibility):

image

The underlying time-series for this pixel looks like this (last few values shown):

 '2018-06-15': 0.0,
 '2018-06-18': 0.0,
 '2018-06-20': nan,
 '2018-06-21': nan,
 '2018-06-23': 0.0,
 '2018-06-25': nan,
 '2018-06-28': nan,
 '2018-06-30': 0.0,
 '2018-07-08': nan,
 '2018-07-10': 0.0,
 '2018-07-15': 1.0,
 '2018-07-18': nan,
 '2018-07-23': 0.0,
 '2018-07-25': nan,
 '2018-07-28': nan,
 '2018-07-30': nan

The single date 1-value spike on the right edge of the cube is a clear error in the classification that I want to correct before creating my outputs, leaving other values as they are.

To specify, I am looking for ways to correct (i.e., set as NaN or the opposite value) these types of outliers (i.e., where one or maybe two 0s/1s are surrounded by the other value in time). Note that I want to keep the other values as they are (i.e., cannot drop NaNs since each time-step has NaNs in different spots, it is important to maintain the observed start and end dates for the correctly classified portions for my later calculations). So essentially, I want look at each 0/1, compare it with its non-NaN neighbors and set it to NaN if its neighbors are the opposite value.

I have been playing around with various xarray options (e.g., rolling, interp, resample) but have not gotten the output I want. Maybe there is some means to find local outliers I can use with xarray? I generally struggle to do local time-series operations like this with xarray.

Thank you for your time!

2 Likes

scipy.ndimage.label has been used for something like this before (e.g. here). However, scipy.ndimage.label only works on boolean masks, with 0/False being the background and anything else the foreground. Since you have more than one label, you might need a different function.

skimage.measure.label could work:

from skimage.measure import label
import flox

def _consecutive_groups(categories):
    labels, _ = label(categories, background=-1, connectivity=1)

    counts, _ = flox.groupby_reduce(labels, labels, func="count")
    counts[0] = 0  # background values
    return counts[labels]

def consecutive_groups(categories):
    return xr.apply_ufunc(
        _consecutive_groups,
        categories.fillna(-1).astype("int"),
        input_core_dims=[["time"]],
        output_core_dims=[["time"]],
        vectorize=True,
        dask="parallelized",
        output_dtypes=["uint16"],
    )

That will give you the length for each group, and you could also return the computed groups should you need that.

After that, a simple ds.where(group_lengths > 1) should mask out all the length-1 groups.

2 Likes

Thanks @keewis for the idea! I have not yet been able to implement it on my end - was struggling to get to the correct format for the categories variable. Although I did have skimage.measure.label functioning on test 2D arrays (without time dimension).

I did end up with an approach that is working for me based around xarray cumsum. This was inspired by python - Convert cumsum() output to binary array in xarray - Stack Overflow.

First, for value 1, calculate cumsum, but reset it each time 0 is found:

cumsum = cube.cumsum(dim = 'time') - cube.cumsum(dim = 'time').where(cube== 0).ffill(dim = 'time').fillna(0)

Next, find groups of value 1 that meet condition (thresh = 3 consecutive 1’ observations, skipping NaNs, all observations that are part of this group will be kept.):

grps1 = xr.full_like(cumsum, fill_value = 0) 
grps1 = xr.where(cumsum >= thresh, 1, grps1) 
grps1 = xr.where((cumsum > 0) & (cumsum < thresh), np.nan, grps1 )
grps1 = grps1 .bfill(dim = 'time')

Then, flip the input cube and repeat this process (i.e., find groups of value 0s - which for cumsum have been set to value 1):

cube_flip = xr.where(cube == 1, 0, cube)
cube_flip = xr.where(cube == 0, 1, cube_flip)

# Next, repeat above, with grps0

Finally, create a cleaned cube based on these groups, and replace NaNs from the original cube to get the desired result (outliers removed, but otherwise original observations/NaNs remain intact):

cube_clean = xr.where(grps1 == 1, 1, np.nan)
cube_clean = xr.where(grps0 == 1, 0, cube_clean)
cube_clean = xr.where(cube.isnull(), np.nan, cube_clean) 

This is not as short and clean as the potential of skimage.measure.label and does not scale well if there are many groups that need to be cleaned, but is all dask-complient so runs pretty quick when data needs to be loaded into memory later.

Note that the requires time not to be chunked, which might require rechunking if you have a different chunking scheme.

If that’s not it, can you post a repr of the array you want to analyze? With that, it might be easier to help.

As for your own attempt: if it works properly, then you can definitely just use that, as well.