Writing to lat lon regions with to_zarr(region=)

To be honest, I am new to dask, xarray, and raster data altogether. So, here I am soliciting more advice from this great community.

My goal is to build a large region of ‘merged’ landsat data in one zarr backed (on gcs) xarray dataset. I have a list of tiff files on gcs. I am starting with about 20gb of landsat data, which is about 25 zoom 10 tiles in my case. We will probably stay under 100gb targets for a while.

I currently take a geo, find the bounds, build a shapely polygon, construct tiles with a specific zoom that intersect this geo, look up to gcs to see which ones exist (have been downloaded from our raster source) and prepare that as a list for ingest. I look at the meta data from gcs with rasterio, discover the bounds by taking min/max over the set, and estimate the final width and height (for the xarray dataset by summing the unique heights for each row and unique widths for each column assuming there is at least one tiff example present for each row and column). At the end of this, I use np.linspace with the bounds and estimated height and width to build the target xarray zarr dataset and push meta data with compute=False.

The target looks like this:

Next, I iterate over each tiff file and do something like this to estimate the target region:

dset_example = dset_example.astype(self.dtype)

# get logical coordinates from lat/lon slice coordinates
dlats = dset_example.lat
tlats = self.target.lat
lat_selection = np.array((tlats <= dlats.max()) & (tlats >= dlats.min()))
lat_region = np.argwhere(lat_selection).flatten()

dlons = dset_example.lon
tlons = self.target.lon
lon_selection = np.array((tlons <= dlons.max()) & (tlons >= dlons.min()))
lon_region = np.argwhere(lon_selection).flatten()

# build slice
lon_slice = min(lon_region), max(lon_region)
lat_slice = min(lat_region), max(lat_region)
lat_diff = max(lat_region) - min(lat_region)
lon_diff = max(lon_region) - min(lon_region)

# some ops before writing
dset_example = dset_example.sel(x=slice(0, lat_diff), y=slice(0, lon_diff))

# construct the slice dict
slice_dict = dict(x=slice(*lat_slice), y=slice(*lon_slice))
if "time" in dset_example.dims:
    slice_dict.update({"time": slice(0, dset_example.dims["time"])})     

Everything seems to be working for the most part, but there is some missing data where tiles intersect. The top corners are simply tiles that we do not have, which is not the issue. Instead, the thin lines are the issue of concern

Any thoughts on this approach? Does stackstac automatically do all this? I know @RichardScottOZ seems to do a lot of this and we seem to have similar goals of building these types of target datasets for training and inference. Any pointers or tools to look into? I feel like I am reinventing the wheel a bit.

1 Like

First thought is this could be checked:
np.linspace with the bounds and estimated height and width

not off by 1 or something?

Instead of trying to create a mosaic, I am now focusing on just adding a coordinate for the specific tile, by padding all tile to the maximum height and width and attempting to append to a remote datastore with the to_zarr(region=) approach.

This is what the datasets look like now:

I ultimately would like the final target to looks something like the concatenated dataset above where I build the individual datasets in multiple_datasets in separate processes and write out to the target.

@rabernat do you have any suggestions on how to index the target dsets so that I can write to it from separate processes?

From what I gather, I believe I need to figure out all the times and tiles in advance to build out the target dataset so that the full shape is specified (I think append is not supported to remote datastores). However, I am not sure how to handle the multiple coordinates with the region. Any advice on how to accomplish this or are there any obvious issues with this approach?

I continue to keep getting ReadOnlyError: object is read-only errors while trying all combinations of indexing and slicing that I can think of. Looks like the read only error was coming from passing only a gcs string to the to_zarr function instead of a fssmap.

The overall idea is to first create a dataset with the desired shape and chunks, but without writing any data. Then you make sure to align your distributed writes with the chunks, so there are no conflicts between processes and no locks required.

Looks like you may have hit this issue:

1 Like

@rabernat that is what I am discovering.

Being limited to slice() in the region argument implies that one must use integers to write out to slices only and not fancy indexing. My current target coords are below. I use dask for tasks at the tile level for all available years, which might be a subset of the target. When I write to the target, I iterate over the available years and write to a specific tile. I can not use slice because I might be missing a year. For example, the implication is that I need to know that tile 376_521_10 is the 1st element in the target (can easily be represented as a slice) and times [2016-01-01', '2015-01-01'] have an integer index of [0,2] (which can not be represented as a slice). I could parallelize at the year-tile level, but some tiffs I am “ingesting” stack yearly info in the bands, which is annoying.

Anyways, does anyone know if there are plans to support fancyish indexing? For example, to_zarr(region=dict(time=[0,1,2,4]))? And if that is supported, maybe supporting to_zarr(region=dict(tile=['376_521_10'], time=['2016-01-01','2015-01-01']))

Does anyone know of a way to write target datasets with compte=False by taking advantage of dask distributed?

It takes roughly 30 seconds for each of my two targets and I do have a dask cluster I could leverage.

Any known ways to speed that up?

It might be a bit difficult to learn from, but pangeo-forge-recipes does use compute=False at pangeo-forge-recipes/xarray_zarr.py at 6194c72343d82e39b09edcafbe7085b1d9464787 · pangeo-forge/pangeo-forge-recipes · GitHub. The idea is to pre-allocate the zarr metadata for your output dataset. Then it’s able to write individual chunks in parallel (in pangeo-forge that’s done in store_chunk: pangeo-forge-recipes/xarray_zarr.py at 6194c72343d82e39b09edcafbe7085b1d9464787 · pangeo-forge/pangeo-forge-recipes · GitHub)

1 Like

When I call target.to_zarr(store, compute=False) it takes 30 seconds and runs locally and I have to do this twice for two separate targets. I am just looking for a possible way to speed that up.

Is there a way to submit this to the cluster itself to speed up writing the pre-allocated metadata?

Edit: I have looked at the recipes and honestly struggle to follow along. I am still pretty new to xarray and dask.

When you call target.to_zarr(store, compute=False), Xarray will write the overall dataset structure and the dimension coordinates. This should not be a data-intensive step. The amount of time it takes will depend on many factors, including:

  • The number of variables in the dataset
  • The amount of metadata
  • The size of the dimension coordinates
  • The quality of the internet connection between client and target

To understand why it takes 30s, I recommend profiling. Specifically snakeviz is our tool of choice.

If you would like more help here to optimize performance, I recommend you post the following information here to the forum:

  • The complete code you are running (e.g. link to gist)
  • The full Xarray repr of the dataset you’re trying to write (print(target))
  • Detailed information about the target Zarr dataset (open the store with Zarr and use the Zarr .info methods)
  • The snakeviz profile
1 Like

I appreciate it. I was simply seeing if there might be a way to speed up writing target metadata. I’ll keep that in mind and when asking for specific help optimizing a specific thing I’ll follow those guidelines!

btw, speeding this up is not really that big of a deal, just trying to get a lay of the land since I struggle to find docs around all potential use cases of .to_zarr(compute=False). For example, what I didn’t realize we could .to_zarr(..., mode="a", compute=False). I gather that is delaying the chunk write and not the target meta, but does that require that an initial target’s metadata is written first?

Thanks Ryan, likely going to run into a similar problem at some stage.