ZARR: Selecting Data Based on Bounding Box Coordinates with TIME Dimensions Only dataset

For ref, I posted the same question on stackoverflow (dask - ZARR: Inefficiency in Selecting Data Based on Bounding Box Coordinates with TIME Dimensions Only dataset - Stack Overflow)

I have the following dataset which is a trajectory type dataset.
LATITUDE and LONGITUDE are defined as coordinates in xarray BUT they aren’t dimensions! All of the issues I’ve came across on stackoverflow and forums have LATITUDE and LONGITUDE as dimension.

Here’s a brief overview of the dataset.

Dimensions:
TIME: 120385119

Coordinates:
DEPTH (TIME) float64 dask.array<chunksize=(10000,), meta=np.ndarray>
LATITUDE (TIME) float64 dask.array<chunksize=(10000,), meta=np.ndarray>
LONGITUDE (TIME) float64 dask.array<chunksize=(10000,), meta=np.ndarray>
TIME (TIME) datetime64[ns]
2008-11-25T00:54:26.443568128 .....

Data variables: (61)

Indexes: (1)

Attributes: (43)

My issue is that this dataset, currently stored on S3 as a ZARR dataset, is horrendously slow at selecting data based on a geographical bounding box. Selecting based on TIME is fine though since it’s a dimension.

In a perfect world, I’d like to achieve this:

%%time
import xarray as xr
import fsspec

# remote zarr dataset
url = 's3://imos-data-lab-optimised/zarr/target.zarr'
ds = xr.open_zarr(fsspec.get_mapper(url, anon=True), consolidated=True,  chunks={'TIME': 10000}) # maybe important to load the ds with chunks not to kill the amount of ram

subset = ds[['PSAL', 'TEMP', 'DEPTH']]
subset = subset.where((ds.LONGITUDE>=151) & (ds.LONGITUDE<=152) & (ds.LATITUDE>=-34) & (ds.LATITUDE<=-33), drop=True)
subset

Unfortunately, the current process is time-consuming since it involves downloading all the chunks for LATITUDE and LONGITUDE to locate the required data, which is counterproductive. I aim to accomplish a more efficient approach similar to setting an index on a column, as seen in PostgreSQL, for instance.

I’ve tried several approaches, including indexing, using a pandas multiindex, and attempting regridding (though it seems impractical due to the array’s size). However, I find myself going in circles, and I can’t believe no one else has encountered this issue.

I’m open to suggestions in possibly re-writting this zarr dataset, considering that the dataset is not static and grows with new trajectories and timestamp overtime.

I’d start by looking at the chunks of your zarr store (look at the encoding of a variable, e.g. ds.DEPTH.encoding, or try opening with chunks={}). Are these integer multiples of your chosen chunksize? Chunks are the smallest unit of whatever can be read from disk, so if you choose a chunksize that is not an integer multiple of the on-disk chunksize dask / zarr will read the full chunk only to throw away parts of what was read.

In this case, I see that the chunksize for TIME is 2000000, which means that by choosing a chunksize of 10000 we only keep a tiny fraction of the chunk, making the reading very inefficient.

Though once that’s resolved you’re back to where, which as you noticed is not the fastest, even if latitude and longitude are in memory.

Unfortunately, I believe the where is your only option for now (I might be wrong, though), since as far as I remember the (experimental) custom 2D indexes for xarray that exist don’t support range queries / slices, and I don’t think scipy’s/sklearn’s KDTree / BallTree implementations support it either (circle query aside) – which is what the 2D indexes I talked about would use. cc @benbovy

To load latitude and longitude into memory, try

ds.assign_coords(ds[["LATITUDE", "LONGITUDE"]].compute().coords)

What’s distinct about your use case is that you are basically storing vector data, not raster. Another way of putting it is that your have a table, not a multidimensional array. So I’d consider using some of the tools for vector / tabular data. Zarr and Xarray have not been optimized for this use case (although they could be).

Try storing your data in geoparquet instead of Zarr and querying it with GeoPandas. You’ll still have to download all of the lats / lons to build a client-side index (there’s no way around this unless you put the data in a database), but it should be faster and more efficient.

1 Like

You’ll still have to download all of the lats / lons to build a client-side index (there’s no way around this unless you put the data in a database)

There’s a bit of discussion around this topic on the geoparquet GitHub repo. Defining Bounding Boxes for Optimizing Spatial Queries · opengeospatial/geoparquet · Discussion #188 · GitHub (bounding boxes using row groups in a single parquet file) and Best Practices for Partitioning GeoParquet? · opengeospatial/geoparquet · Discussion #171 · GitHub (using a Parquet Dataset with multiple files) might be worth following.

Hi, I’ve been working with a dataset with a similar shape and I wanted to write a bit about my strategy for working with it. I don’t know if it will suit your case but perhaps it can help someone. Of course Xarray isn’t really intended for use with this kind of data, but I still find it very useful.

You mention regridding so I assume your coordinates are on a predefined grid? This is what the strategy relies upon. If not, then you could regrid and then try this.

Say for each grid point you have a latitude, a longitude, and a unique ID. I’m assuming you have a way to easily get or create and then retrieve the unique ID for each pair somehow. Then you can have a long 1D numpy array for each (it’s important later that the grid_ids vector is consecutive integers starting from zero, equivalent to the index of grid_lats and grid_lons. You don’t really need it as a separate variable but it’s helpful for readability):

>>> grid_ids = np.array([0, 1, 2, 3, 4, 5])
>>> grid_lats = np.array([20, 25, 20, 25, 20, 25])
>>> grid_lons = np.array([0, 2, 4, 6, 8, 10])

Now you can easily get the grid_ids that correspond to your bounding box, which will be the same as the indices returned by a np.where:

>>> latmin, lonmin, latmax, lonmax = (23, 1, 32, 9)
>>> bbox_ids = np.where((grid_lats >= latmin)
...                    &(grid_lats <= latmax)
...                    &(grid_lons >= lonmin)
...                    &(grid_lats <= lonmax))[0]
>>> bbox_ids
array([1, 3])

At this point you’ll want to create a grid_id vector or similar and add it to your dataset (it exists already in my own situation), which I expect could be time-consuming, but you only need to do it once per dataset.

Now that your dataset is indexed, and you have the IDs of the grid points you want to extract from the dataset, you can create a lookup vector which will make this extraction fast. This is a 1D boolean array with the same size as your grid_ids array, but with value 0 if the id is outside the bounding box, and 1 if the value is within the bounding box:

>>> lookup_vector = np.zeros(len(grid_ids), dtype=bool)
>>> lookup_vector[bbox_ids] = 1
>>> lookup_vector
array([False,  True, False,  True, False, False])

Now to know if any grid point is within the bounding box, you can index lookup_vector with the grid point’s ID. Then, finally, you can leverage this to do a fast selection from the dataset. Passing your dataset’s new grid_point coordinate to the lookup_vector will create a boolean array with the same length as the time dimension that will be True for any datapoint that lies within the bounding box. This selection is very fast compared to a where:

subset = ds[['PSAL', 'TEMP', 'DEPTH']]
subset = subset.isel(time=lookup_vector[ds["grid_point"]])
subset

I think you can also use a regular sel there, but I’m not sure if there are any subtle differences when using a boolean array.

This is all a bit convoluted, I admit, but it really speeds up this operation for me. Hope it can help somebody else too.

1 Like