Writing to lat lon regions with to_zarr(region=)

Leonard_Strnad · December 16, 2021, 5:47pm

To be honest, I am new to dask, xarray, and raster data altogether. So, here I am soliciting more advice from this great community.

My goal is to build a large region of ‘merged’ landsat data in one zarr backed (on gcs) xarray dataset. I have a list of tiff files on gcs. I am starting with about 20gb of landsat data, which is about 25 zoom 10 tiles in my case. We will probably stay under 100gb targets for a while.

I currently take a geo, find the bounds, build a shapely polygon, construct tiles with a specific zoom that intersect this geo, look up to gcs to see which ones exist (have been downloaded from our raster source) and prepare that as a list for ingest. I look at the meta data from gcs with rasterio, discover the bounds by taking min/max over the set, and estimate the final width and height (for the xarray dataset by summing the unique heights for each row and unique widths for each column assuming there is at least one tiff example present for each row and column). At the end of this, I use np.linspace with the bounds and estimated height and width to build the target xarray zarr dataset and push meta data with compute=False.

The target looks like this:

Next, I iterate over each tiff file and do something like this to estimate the target region:

dset_example = dset_example.astype(self.dtype)

# get logical coordinates from lat/lon slice coordinates
dlats = dset_example.lat
tlats = self.target.lat
lat_selection = np.array((tlats <= dlats.max()) & (tlats >= dlats.min()))
lat_region = np.argwhere(lat_selection).flatten()

dlons = dset_example.lon
tlons = self.target.lon
lon_selection = np.array((tlons <= dlons.max()) & (tlons >= dlons.min()))
lon_region = np.argwhere(lon_selection).flatten()

# build slice
lon_slice = min(lon_region), max(lon_region)
lat_slice = min(lat_region), max(lat_region)
lat_diff = max(lat_region) - min(lat_region)
lon_diff = max(lon_region) - min(lon_region)

# some ops before writing
dset_example = dset_example.sel(x=slice(0, lat_diff), y=slice(0, lon_diff))

# construct the slice dict
slice_dict = dict(x=slice(*lat_slice), y=slice(*lon_slice))
if "time" in dset_example.dims:
    slice_dict.update({"time": slice(0, dset_example.dims["time"])})

Everything seems to be working for the most part, but there is some missing data where tiles intersect. The top corners are simply tiles that we do not have, which is not the issue. Instead, the thin lines are the issue of concern

Any thoughts on this approach? Does stackstac automatically do all this? I know @RichardScottOZ seems to do a lot of this and we seem to have similar goals of building these types of target datasets for training and inference. Any pointers or tools to look into? I feel like I am reinventing the wheel a bit.

RichardScottOZ · December 16, 2021, 9:05pm

First thought is this could be checked:
np.linspace with the bounds and estimated height and width

not off by 1 or something?

Leonard_Strnad · January 4, 2022, 8:18pm

Instead of trying to create a mosaic, I am now focusing on just adding a coordinate for the specific tile, by padding all tile to the maximum height and width and attempting to append to a remote datastore with the to_zarr(region=) approach.

This is what the datasets look like now:

I ultimately would like the final target to looks something like the concatenated dataset above where I build the individual datasets in multiple_datasets in separate processes and write out to the target.

@rabernat do you have any suggestions on how to index the target dsets so that I can write to it from separate processes?

From what I gather, I believe I need to figure out all the times and tiles in advance to build out the target dataset so that the full shape is specified (I think append is not supported to remote datastores). However, I am not sure how to handle the multiple coordinates with the region. Any advice on how to accomplish this or are there any obvious issues with this approach?

I continue to keep getting ReadOnlyError: object is read-only errors while trying all combinations of indexing and slicing that I can think of. Looks like the read only error was coming from passing only a gcs string to the to_zarr function instead of a fssmap.

rabernat · January 5, 2022, 2:56pm

The overall idea is to first create a dataset with the desired shape and chunks, but without writing any data. Then you make sure to align your distributed writes with the chunks, so there are no conflicts between processes and no locks required.

Looks like you may have hit this issue:

github.com/zarr-developers/zarr-python

Using consolidate_metadata with FSStore result in a read-only store

opened 09:49PM - 20 Dec 21 UTC

closed 10:44AM - 06 Jan 22 UTC

jhamman

bug

#### Minimal, reproducible code sample, a copy-pastable example if possible `…``python import xarray as xr from zarr.convenience import consolidate_metadata ds = xr.tutorial.open_dataset('air_temperature') ds.to_zarr('file://foo.zarr', mode='w', consolidated=False) consolidate_metadata('file://foo.zarr') ``` raises: ```python --------------------------------------------------------------------------- ReadOnlyError Traceback (most recent call last) /tmp/ipykernel_63/1636373769.py in <module> ----> 1 consolidate_metadata('file://foo.zarr') /srv/conda/envs/notebook/lib/python3.9/site-packages/zarr/convenience.py in consolidate_metadata(store, metadata_key) 1126 } 1127 } -> 1128 store[metadata_key] = json_dumps(out) 1129 return open_consolidated(store, metadata_key=metadata_key) 1130 /srv/conda/envs/notebook/lib/python3.9/site-packages/zarr/storage.py in __setitem__(self, key, value) 1134 def __setitem__(self, key, value): 1135 if self.mode == 'r': -> 1136 raise ReadOnlyError() 1137 key = self._normalize_key(key) 1138 path = self.dir_path(key) ReadOnlyError: object is read-only ``` #### Problem description When using a fsspec compatible string as the store argument, zarr's `consolidate_metadata` function returns a read-only store, resulting in the error shown above. I believe the expected behavior is to have generate an `FSStore` object with write privileges (so the consolidated meatadata key can be written). #### Version and installation information Please provide the following: * Value of ``zarr.__version__``: 2.10.3 * Value of ``numcodecs.__version__``: 0.9.1 * Version of Python interpreter: 3.9 * Operating system (Linux/Windows/Mac): Linux and OSX * How Zarr was installed: conda

Leonard_Strnad · January 5, 2022, 4:02pm

@rabernat that is what I am discovering.

Being limited to slice() in the region argument implies that one must use integers to write out to slices only and not fancy indexing. My current target coords are below. I use dask for tasks at the tile level for all available years, which might be a subset of the target. When I write to the target, I iterate over the available years and write to a specific tile. I can not use slice because I might be missing a year. For example, the implication is that I need to know that tile 376_521_10 is the 1st element in the target (can easily be represented as a slice) and times [2016-01-01', '2015-01-01'] have an integer index of [0,2] (which can not be represented as a slice). I could parallelize at the year-tile level, but some tiffs I am “ingesting” stack yearly info in the bands, which is annoying.

Anyways, does anyone know if there are plans to support fancyish indexing? For example, to_zarr(region=dict(time=[0,1,2,4]))? And if that is supported, maybe supporting to_zarr(region=dict(tile=['376_521_10'], time=['2016-01-01','2015-01-01']))

Leonard_Strnad · January 13, 2022, 7:44pm

Does anyone know of a way to write target datasets with compte=False by taking advantage of dask distributed?

It takes roughly 30 seconds for each of my two targets and I do have a dask cluster I could leverage.

Any known ways to speed that up?

TomAugspurger · January 13, 2022, 8:07pm

It might be a bit difficult to learn from, but pangeo-forge-recipes does use compute=False at pangeo-forge-recipes/xarray_zarr.py at 6194c72343d82e39b09edcafbe7085b1d9464787 · pangeo-forge/pangeo-forge-recipes · GitHub. The idea is to pre-allocate the zarr metadata for your output dataset. Then it’s able to write individual chunks in parallel (in pangeo-forge that’s done in store_chunk: pangeo-forge-recipes/xarray_zarr.py at 6194c72343d82e39b09edcafbe7085b1d9464787 · pangeo-forge/pangeo-forge-recipes · GitHub)

Leonard_Strnad · January 13, 2022, 9:37pm

When I call target.to_zarr(store, compute=False) it takes 30 seconds and runs locally and I have to do this twice for two separate targets. I am just looking for a possible way to speed that up.

Is there a way to submit this to the cluster itself to speed up writing the pre-allocated metadata?

Edit: I have looked at the recipes and honestly struggle to follow along. I am still pretty new to xarray and dask.

rabernat · January 14, 2022, 2:47pm

When you call target.to_zarr(store, compute=False), Xarray will write the overall dataset structure and the dimension coordinates. This should not be a data-intensive step. The amount of time it takes will depend on many factors, including:

The number of variables in the dataset
The amount of metadata
The size of the dimension coordinates
The quality of the internet connection between client and target

To understand why it takes 30s, I recommend profiling. Specifically snakeviz is our tool of choice.

If you would like more help here to optimize performance, I recommend you post the following information here to the forum:

The complete code you are running (e.g. link to gist)
The full Xarray repr of the dataset you’re trying to write (print(target))
Detailed information about the target Zarr dataset (open the store with Zarr and use the Zarr .info methods)
The snakeviz profile

Leonard_Strnad · January 14, 2022, 5:34pm

I appreciate it. I was simply seeing if there might be a way to speed up writing target metadata. I’ll keep that in mind and when asking for specific help optimizing a specific thing I’ll follow those guidelines!

btw, speeding this up is not really that big of a deal, just trying to get a lay of the land since I struggle to find docs around all potential use cases of .to_zarr(compute=False). For example, what I didn’t realize we could .to_zarr(..., mode="a", compute=False). I gather that is delaying the chunk write and not the target meta, but does that require that an initial target’s metadata is written first?

RichardScottOZ · January 15, 2022, 12:08am

Thanks Ryan, likely going to run into a similar problem at some stage.

Topic		Replies	Views
Synchronizer for Zarr + Dask on Kubernetes Data	10	1877	January 16, 2024
Advice on writing many slices from one remote zarr xarray to another Data	4	578	January 15, 2022
Rate limited with `.to_zarr(region=)` Data	3	584	March 4, 2022
Using to_zarr(region=) and extending the time dimension? Data	10	2499	June 22, 2022
Map_blocks and to_zarr(region=) Data	6	1224	May 26, 2022

Writing to lat lon regions with to_zarr(region=)

Related topics