To be honest, I am new to dask, xarray, and raster data altogether. So, here I am soliciting more advice from this great community.
My goal is to build a large region of ‘merged’ landsat data in one zarr backed (on gcs) xarray dataset. I have a list of tiff files on gcs. I am starting with about 20gb of landsat data, which is about 25 zoom 10 tiles in my case. We will probably stay under 100gb targets for a while.
I currently take a geo, find the bounds, build a shapely polygon, construct tiles with a specific zoom that intersect this geo, look up to gcs to see which ones exist (have been downloaded from our raster source) and prepare that as a list for ingest. I look at the meta data from gcs with rasterio, discover the bounds by taking min/max over the set, and estimate the final width and height (for the xarray dataset by summing the unique heights for each row and unique widths for each column assuming there is at least one tiff example present for each row and column). At the end of this, I use np.linspace
with the bounds and estimated height and width to build the target xarray zarr dataset and push meta data with compute=False.
The target looks like this:
Next, I iterate over each tiff file and do something like this to estimate the target region:
dset_example = dset_example.astype(self.dtype)
# get logical coordinates from lat/lon slice coordinates
dlats = dset_example.lat
tlats = self.target.lat
lat_selection = np.array((tlats <= dlats.max()) & (tlats >= dlats.min()))
lat_region = np.argwhere(lat_selection).flatten()
dlons = dset_example.lon
tlons = self.target.lon
lon_selection = np.array((tlons <= dlons.max()) & (tlons >= dlons.min()))
lon_region = np.argwhere(lon_selection).flatten()
# build slice
lon_slice = min(lon_region), max(lon_region)
lat_slice = min(lat_region), max(lat_region)
lat_diff = max(lat_region) - min(lat_region)
lon_diff = max(lon_region) - min(lon_region)
# some ops before writing
dset_example = dset_example.sel(x=slice(0, lat_diff), y=slice(0, lon_diff))
# construct the slice dict
slice_dict = dict(x=slice(*lat_slice), y=slice(*lon_slice))
if "time" in dset_example.dims:
slice_dict.update({"time": slice(0, dset_example.dims["time"])})
Everything seems to be working for the most part, but there is some missing data where tiles intersect. The top corners are simply tiles that we do not have, which is not the issue. Instead, the thin lines are the issue of concern
Any thoughts on this approach? Does stackstac automatically do all this? I know @RichardScottOZ seems to do a lot of this and we seem to have similar goals of building these types of target datasets for training and inference. Any pointers or tools to look into? I feel like I am reinventing the wheel a bit.