Efficient data selection without for loops

suryadheeshjith · July 15, 2024, 3:55pm

Hello, I am looking for an efficient way to extract data with multiple slices in a list. The best way to show this is with an example -

# Create the data array
data = xr.DataArray(
    np.random.rand(5, 6),
    dims=["time", "variable"],
    coords={
        "time": np.arange(5),
        "variable": np.arange(6),
    },
)

# Define slices
slices = [slice(0, 2), slice(1, 3), slice(2, 4)] # Same length and continous slices

# Select the data for each slice
sliced_data = [data.isel(time=slc).assign_coords(time=np.arange(slc.stop - slc.start)) for slc in slices]

# Concatenate along the new dimension 'window_dim'
result = xr.concat(sliced_data, dim='window_dim')

print(result)

OUTPUT:
<xarray.DataArray (window_dim: 3, time: 2, variable: 6)>
array([[[0.33547378, 0.67330893, 0.69904389, 0.88787631, 0.26807342,
         0.07760665],
        [0.78355031, 0.6135081 , 0.75868513, 0.16590802, 0.71739294,
         0.42383822]],

       [[0.78355031, 0.6135081 , 0.75868513, 0.16590802, 0.71739294,
         0.42383822],
        [0.01768034, 0.5773279 , 0.09635795, 0.0637734 , 0.63216361,
         0.78761642]],

       [[0.01768034, 0.5773279 , 0.09635795, 0.0637734 , 0.63216361,
         0.78761642],
        [0.4377235 , 0.42413106, 0.16612197, 0.1085243 , 0.35388582,
         0.47942606]]])
Coordinates:
  * time      (time) int64 0 1
  * variable  (variable) int64 0 1 2 3 4 5
Dimensions without coordinates: window_dim

I am wondering how to do this without the for loop or xr.concat(). I have tried using rolling() on the data like so -

rolling_data = data.rolling(time=len(data.time)-total_steps, center=False).construct('window_dim')
data2 = rolling_data.transpose('window_dim', 'time', 'variable', 'y', 'x').isel(time=slice(len(data.time)-total_steps-1,None))

But when I choose a certain index/slice of indices, and convert to numpy before I pass it to my machine learning model, it takes forever. Thus, my idea is to use rolling indices and extract the data but as you can see it requires the for loop which is not the best idea.

keewis · July 15, 2024, 4:37pm

Something similar has been asked before (see Extract small data cubes around observation points in datacube, except that was slicing in label space instead of index space), but the answer is still the same: there’s no builtin way to do this, so one way or another you’ll have to loop manually. However, instead of indexing then concatenating xarray objects, we can just as well slice a range object, construct an integer index and only then index the xarray object:

def slice_indices(size, slices):
    range_ = range(size)
    return np.array([range_[slice_] for slice_ in slices])

dims = ["window_dim", "time"]
data.isel(time=xr.Variable(dims, slice_indices(data.sizes["time"], slices)))

For more advanced slicing, you might also be interested in xbatcher.

suryadheeshjith · July 15, 2024, 5:19pm

This is helpful. I didn’t think you could index using an xarray object with multiple dimensions!

dcherian · July 16, 2024, 12:40pm

when I choose a certain index/slice of indices, and convert to numpy before I pass it to my machine learning model, it takes forever.

Are you using dask? If not, I would try transposing to the appropriate order before calling rolling.construct

Topic		Replies	Views
Efficiently slicing random windows for reduced xarray dataset	27	2082	August 5, 2022
Very slow selection of multiple points in large dataset using xarray Science	5	1118	July 19, 2022
Efficient bootstrapping with Xarray Data	3	601	May 22, 2023
Struggling with large dataset loading/reading using xarray Science	39	13823	February 16, 2023
Combining a set of 2D variables into a 3D DataArray	4	1550	July 23, 2021

Efficient data selection without for loops

Related topics