I’m working on development/architecture of a high-res multi-dimensional gridded dataset which would cover the continental US. This dataset would have three dimensions, with cells roughly 1m square – hence a size of about x: 4.6e6 , y 2.6e6. It would also have a size of about ~24 in a third dimension. So, on the order of 10^14 float32 values.
In some initial tests with xarray / Dask / Zarr on a small and isolated area, I’ve run into issues with the size of the Dask task graph, which overwhelms a sizable scheduler, without any complex calculations. This led me to think that a single CONUS-scale zarr at this resolution might be cumbersome.
I started looking into creating a zarr-of-zarrs via Kerchunk, dividing the CONUS-scale zarr spatially into many smaller zarr stores, similar to e.g., USGS 7.5-minute quads. It seems like this would make ongoing maintenance of the overall dataset (e.g., partial updates) easier for a variety of reasons.
However, I’m struggling to figure out how get Kerchunk to combine multiple zarr stores correctly in both the x and y dimensions. Are there any existing examples of this? And perhaps more importantly, is a zarr-of-zarrs even a good idea?
Test [pseudo]code:
import xarray as xr
# open a single smaller area
xr.open_zarr("./test-zarr/29.zarr")
from kerchunk.zarr import ZarrToZarr
from kerchunk.combine import MultiZarrToZarr
# index no's corresponding to different adjacent subareas
# (these 9 squares should combine 3x3 to form a single larger area)
indexes = [29, 30, 31, 38, 39, 40, 49, 50, 51]
# spatially:
# 29 38 49
# 30 39 50
# 31 40 51
z2zs = []
for i in indexes:
zarr_path = f"./test-zarr/{i}.zarr"
z2z = ZarrToZarr(zarr_path).translate()
z2zs.append(z2z)
mzz = MultiZarrToZarr(
z2zs,
concat_dims=["y", "x"],
identical_dims=["band"],
)
ref = mzz.translate()
backend_kwargs = {
"consolidated": False,
"storage_options": {
"fo": ref,
}
}
ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs=backend_kwargs)
ds
Looking at the kerchunk’d x/y dimensions above, they are 3 times larger than they should be, since all 9 subareas were concatenated in both the x and y dimensions, instead of creating a 3x3 square. My next step would be to perform multiple concatenations like this but in a single dimension at each step, combining the stores in each column row-by-row and then all rows together. But at this point I figure Pangeo knows far better than me.
Edit: is a Kerchunk merge
effectively what I’m asking for? Just found this comment: