Efficient data selection without for loops

Hello, I am looking for an efficient way to extract data with multiple slices in a list. The best way to show this is with an example -

# Create the data array
data = xr.DataArray(
    np.random.rand(5, 6),
    dims=["time", "variable"],
    coords={
        "time": np.arange(5),
        "variable": np.arange(6),
    },
)

# Define slices
slices = [slice(0, 2), slice(1, 3), slice(2, 4)] # Same length and continous slices

# Select the data for each slice
sliced_data = [data.isel(time=slc).assign_coords(time=np.arange(slc.stop - slc.start)) for slc in slices]

# Concatenate along the new dimension 'window_dim'
result = xr.concat(sliced_data, dim='window_dim')

print(result)

OUTPUT:
<xarray.DataArray (window_dim: 3, time: 2, variable: 6)>
array([[[0.33547378, 0.67330893, 0.69904389, 0.88787631, 0.26807342,
         0.07760665],
        [0.78355031, 0.6135081 , 0.75868513, 0.16590802, 0.71739294,
         0.42383822]],

       [[0.78355031, 0.6135081 , 0.75868513, 0.16590802, 0.71739294,
         0.42383822],
        [0.01768034, 0.5773279 , 0.09635795, 0.0637734 , 0.63216361,
         0.78761642]],

       [[0.01768034, 0.5773279 , 0.09635795, 0.0637734 , 0.63216361,
         0.78761642],
        [0.4377235 , 0.42413106, 0.16612197, 0.1085243 , 0.35388582,
         0.47942606]]])
Coordinates:
  * time      (time) int64 0 1
  * variable  (variable) int64 0 1 2 3 4 5
Dimensions without coordinates: window_dim

I am wondering how to do this without the for loop or xr.concat(). I have tried using rolling() on the data like so -

rolling_data = data.rolling(time=len(data.time)-total_steps, center=False).construct('window_dim')
data2 = rolling_data.transpose('window_dim', 'time', 'variable', 'y', 'x').isel(time=slice(len(data.time)-total_steps-1,None))

But when I choose a certain index/slice of indices, and convert to numpy before I pass it to my machine learning model, it takes forever. Thus, my idea is to use rolling indices and extract the data but as you can see it requires the for loop which is not the best idea.

1 Like

Something similar has been asked before (see Extract small data cubes around observation points in datacube, except that was slicing in label space instead of index space), but the answer is still the same: there’s no builtin way to do this, so one way or another you’ll have to loop manually. However, instead of indexing then concatenating xarray objects, we can just as well slice a range object, construct an integer index and only then index the xarray object:

def slice_indices(size, slices):
    range_ = range(size)
    return np.array([range_[slice_] for slice_ in slices])

dims = ["window_dim", "time"]
data.isel(time=xr.Variable(dims, slice_indices(data.sizes["time"], slices)))

For more advanced slicing, you might also be interested in xbatcher.

2 Likes

This is helpful. I didn’t think you could index using an xarray object with multiple dimensions!

when I choose a certain index/slice of indices, and convert to numpy before I pass it to my machine learning model, it takes forever.

Are you using dask? If not, I would try transposing to the appropriate order before calling rolling.construct