Hey, y’all! I have had some great success with cloud based, zarr-backed xarray datasets. Thanks for all the great work.
I am soliciting some advice to efficiently create one xarray dataset from a much larger one. Specifically, I have a “spatial mosaic” dataset with a sparse variables. I would like to create a chip training dataset by slicing out a window for every point that is non-nan for a specific variable and write that to a cloud based, zarr-backed xarray dataset.
The idea is to find all non-nan points in a specific variable, slice a window about that point including all other variables, and reduce this set into a dataset of reasonable size and then write that size to a target dataset.
Here is the general idea in code:
import dask
import numpy as np
target_features = ['a','b']
times = [2019]
final_height = 10000
final_width = 10000
xy_chunksize = 1000
dtype = np.float32
##### Create a dummy dataset to demonstrate the source data
# dummies for a
a_data = dask.array.zeros(
(len(times), final_width, final_height),
chunks=[1, xy_chunksize, xy_chunksize],
dtype=dtype,
)
# random nan for b
b = np.arange(final_height * final_width).astype(np.float32)
idx = np.random.randint(0, final_width * final_height, size=int(1e6))
b[idx] = np.nan
b_data = np.repeat(b, len(times)).reshape((-1, final_width, final_height))
spatial_dset = xr.Dataset(
data_vars={
'a': (["time", "x", "y"], a_data),
'b':(["time", "x", "y"], b_data)
},
coords={
"lon": (["x"], np.arange(final_width)),
"lat": (["y"], np.arange(final_height)),
"times": (("time"), times),
},
)
# create the expected chip datasets
chip_width = 3
n = int(np.isnan(spatial_target.b).sum())
data_variables = list(spatial_target.data_vars)
# establish the shape of each feature
dummies = dask.array.zeros(
(n, chip_width, chip_width), chunks=[n, chip_width, chip_width], dtype=dtype
)
# establish the shape of the lat and lon coordinates
lon_dummies = dask.array.zeros((n, chip_width), dtype="float64")
lat_dummies = dask.array.zeros((n, chip_width), dtype="float64")
time_dummies = dask.array.zeros((n), dtype="<M8[ns]")
data_vars = {
nm: (["i", "x", "y"], dummies) for nm in list(data_variables)
}
chip_target = xr.Dataset(
data_vars=data_vars,
coords={
"lon": (["i", "x"], lon_dummies),
"lat": (["i", "y"], lat_dummies),
"i": range(n),
"time": ("i", time_dummies),
},
)
#write this target
chip_target.to_zarr("gs://...", compute=False)
This example is using nan as the sparse points of interest, but my actual use case will be non-nan. Anyways, since the sparse data is uniform over the entire spatial dataset, I could leverage something like map_blocks with a function like
def slice_non_nan(block_dset):
idx = np.where(~np.isnan(block_dset['b']))
if len(idx[0]) > 0:
chips = []
for x,y in zip(idx):
chip = block_dset.sel(
x=slice(x - chip_size, x + chip_size),
y=x=slice(y - chip_size, y + chip_size)
)
chips.append(chip)
block_chip_dset = xr.concat(chips)
block_chip_dset.to_zarr(
'gs://path_to_target',
append_dim='i',
synchronizer=synchronizer
)
and simply call
spatial_dset.map_blocks(slice_non_nan).compute()
Is there a more reasonable approach for this? My actual spatial_dset
has dims frozen({'time': 2, 'x': 48248, 'y': 48050})
with 42 float32 variables, which is about 770gb. The feature of interest has about 10M non-nan examples that are uniformly spread out. From what I understand, the map_blocks method would be a hack since it is supposed to return a dataset. Any help on making this more efficient or links to relevant existing functions would be greatly appreciated!