Datasets variables percentile classification over timesteps

Hi everyone,

Situation description

Here is the situation I’m dealing with:

  • I have several tif file that represent Land Surface Temperature on a specific area (area is fixed)
  • each tif file correspond to one timestep
  • only one variable/band exists in each tif file (Land Surface Temperature).


At the end of the day, I would like to be able to:

  • for each pixel
  • classify its belonging to each timestep percentile intervals (ex: [5%, 25%, 50%, 75%, 90%])
  • depending on the season (WINTER, SUMMER)

In other words:
during summer of year Y, the pixel X was:

  • 10% part of [0, 5%[ “percentile interval”,
  • 15% part of [5, 25%[ “percentile interval”,
  • 20% part of [25, 50%[ “percentile interval”,
  • 25% part of [50, 75%[ “percentile interval”,
  • 20% part of [75, 90%[ “percentile interval”,
  • 10% No Data (meaning NaN)

Outputs expected

At the moment, I’ve used xr.concat to concatenate all the tif file product, adding the time dimension to the already existing x, y and band ones.

As an intermediate step, I would be expecting a similarly “shaped” datasets, with class labels instead of LST values.

This dataset would then be used as input for a dashboard.

I’ve been around, looking for ways of achieving this but still haven’t found the good, efficient solution to this.
I have been dealing with ‘nan’ values that I am not able to ‘catch’ (not matching ‘nan’ as a string, neither as a ‘np.nan’, …).
By looking towards you guys, I’m also anticipating the fact that it might be wise to select a performance-oriented solution - since the process would most probably have to deal with it in the future.
I’m sure that somewhere, out there, smart people would have done similar things.
I’m hoping I could benefit from your experience.

Thanks in advance for your time.

1 Like

I would try something like DataArray.resample(time='QS-DEC').quantile([0, .05, .25, .5, .75, .9, 1]) to group by season and compute the values corresponding to the percentile intervals.

Then something like numpy.histogram to count the number of values falling into each bin. If you need to know which class each value belongs to, numpy.digitize will do that.


1 Like

The xhistogram package may be useful for the latter step.

Building a sound timeseries

A quick add-on to the input situation: the tif raster do not have any time dimension natively.
I will have to process the tif filename to extract the date.

You might have a smart way to add this Date infomation to a time dimension while merging all those tif file at the same time (Am I expecting too much magic :smiley: ?).

xhistogram percentile classification…per timestep

I haven’t been diving that much into the xhistogram package, but the bins inputed in the xhistogram are timestep-dependent : each timestep has its own land surface temperature distribution.

Then I don’t want to apply the xhistogram to the “timeseries”, but rather to each timestep lst dataarray.
If there is a way to do this in one step, it would be great…

At the moment, the only solution I’m picturing is kind of a loop…which is rarely an efficient one, in terms of performance (correct me if I’m wrong in this particular example).


Just as an update on my situation, I’ve been dealing with a simpler situation.
I started from the dataarray corresponding to one timestep only.

def np_array_percentile(np_array, perc_list):
    From a numpy.array (or a dataarray), compute percentile bounds values corresponding to the 'perc_list' (corresponding to the required percentile bounds).
   Return a dictionary associating the percentile bounds to its corresponding value.
    perc_sorted = sorted(perc_list)
    return {perc: np.nanpercentile(np_array, perc) for perc in perc_sorted}

# --- get Land Surface Temperature percentile values from 'ds_t0_band_data' dataarray
extract_lst_percentile_dict = np_array_percentile(ds_t0_band_data, perc_list=[5, 10, 25, 50, 75, 90, 95])

# --- Set input data, and exepcted percentile bounds values
lst = ds_t0_band_data
class_bins = list(extract_lst_percentile_dict.values())

# --- Associate the percentile interval index to dataarray "pixel"
lst_classified = xr.apply_ufunc(
    np.digitize,  # func to run across the input array
    lst,  # func arg 1 (the array that needs to be classified)
    class_bins,    # func arg 2 (the classification bins)

It’s a progress, but still isn’t what I’m looking for, since I would rather have this process be applied to a dataset containing several timesteps, knowing that each timestep’s percentile bounds values should be processed independently from others before even thinking about aggregating.

I’ll report any progress if I found some.

In the meantime, feel free to drop any thoughts that might help me, or others to connect hints, and lead to the expected process.

Thanks again for your time.

Nice progress @mathieuc31!

You might consider wrapping the whole thing inside xarray.apply_ufunc, which would allow you easily parallelize over the time dimension.

Thanks @rabernat ! All cheerings are very much appreciated :sweat_smile: !

I have to say that I’m not yet confortable with xarray.apply_ufunc and its whole ‘vectorization’ power.

When you say…

consider wrapping the whole thing inside xarray.apply_ufunc

…do you mean: getting rid of the whole ‘percentile bounds values determinations steps’, by using a lambda function instead of the xarray.apply_ufunc ‘class_bins’ argument for example ?

Or is there a way to use xarray.apply_ufunc to do just what I was looking for (applying this whole process, not only in a more elegant way for one timestep, but for a dataset/stack of timesteps) ?