Zarr-of-Zarrs / Multi-dimensional Kerchunking

I’m working on development/architecture of a high-res multi-dimensional gridded dataset which would cover the continental US. This dataset would have three dimensions, with cells roughly 1m square – hence a size of about x: 4.6e6 , y 2.6e6. It would also have a size of about ~24 in a third dimension. So, on the order of 10^14 float32 values.

In some initial tests with xarray / Dask / Zarr on a small and isolated area, I’ve run into issues with the size of the Dask task graph, which overwhelms a sizable scheduler, without any complex calculations. This led me to think that a single CONUS-scale zarr at this resolution might be cumbersome.

I started looking into creating a zarr-of-zarrs via Kerchunk, dividing the CONUS-scale zarr spatially into many smaller zarr stores, similar to e.g., USGS 7.5-minute quads. It seems like this would make ongoing maintenance of the overall dataset (e.g., partial updates) easier for a variety of reasons.

However, I’m struggling to figure out how get Kerchunk to combine multiple zarr stores correctly in both the x and y dimensions. Are there any existing examples of this? And perhaps more importantly, is a zarr-of-zarrs even a good idea?

Test [pseudo]code:

import xarray as xr

# open a single smaller area
xr.open_zarr("./test-zarr/29.zarr")

from kerchunk.zarr import ZarrToZarr
from kerchunk.combine import MultiZarrToZarr

# index no's corresponding to different adjacent subareas
# (these 9 squares should combine 3x3 to form a single larger area)
indexes = [29, 30, 31, 38, 39, 40, 49, 50, 51]
# spatially:
# 29  38  49
# 30  39  50
# 31  40  51

z2zs = []

for i in indexes:
    zarr_path = f"./test-zarr/{i}.zarr"
    z2z = ZarrToZarr(zarr_path).translate()
    z2zs.append(z2z)

mzz = MultiZarrToZarr(
    z2zs,
    concat_dims=["y", "x"],
    identical_dims=["band"],
)
ref = mzz.translate()

backend_kwargs = {
    "consolidated": False,
    "storage_options": {
        "fo": ref,
    }
}
ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs=backend_kwargs)
ds

Looking at the kerchunk’d x/y dimensions above, they are 3 times larger than they should be, since all 9 subareas were concatenated in both the x and y dimensions, instead of creating a 3x3 square. My next step would be to perform multiple concatenations like this but in a single dimension at each step, combining the stores in each column row-by-row and then all rows together. But at this point I figure Pangeo knows far better than me. :slight_smile:

Edit: is a Kerchunk merge effectively what I’m asking for? Just found this comment:

First a couple of quick answers:

  • yes, with kerchunk it is totally fine to combine datasets that are already zarr. It should make figuring out which chunk is where easier, but of course you end up duplicating some metadata (which should be small)

  • if your main problem is the size of the dask graph, then you may consider setting chunks={…} to be a multiple of the original chunk size. You may end up loading a little more of the data than you need, but if you are processing some significant window of the data, this is actually faster (due to concurrent reads) and end up with fewer tasks. Which axis(es) to expand depends on your workflow, but you can try (4, 2048, 2048) (8x the size) as a starter. Partition sizes up to 100MB and beyond are typical in dask workloads.

  • we can figure out what’s going wrong with your kerchunk combine, but getting the call just right can be tricky!

Thanks. I’ve experimented with increasing the chunk size, and that’s definitely helped, but I guess my reason for looking at Kerchunk in this way is due to the projected size of the CONUS-scale dataset, which would be on the order of ~100 TB - 1 PB. 100MB chunks would mean ~1m chunks total, and larger chunk sizes would make certain use cases trickier. (For instance, working with smaller portions of the dataset would be a common use case.)

Something I hadn’t thought of until now – I suppose I’ve been hoping that Kerchunk could somehow paper over a very large Dask task graph. But that doesn’t seem right… there are still the same number of actual chunks underneath.

Back to the combine – should it actually be working in this way, with x and y dimensions in a single combine? Or are multiple combine steps required like I’m suspecting?

1 PB. 100MB chunks would mean ~1m chunks total, and larger chunk sizes would make certain use cases trickier.

Worth noting, that client-side, there should be layers that prescribe how to make tasks, not the tasks themselves. I believe this works, but there used to be issues. Then when you slice, only the tasks required for your compute should materialise on the scheduler.

I suppose I’ve been hoping that Kerchunk could somehow paper over a very large Dask task graph

Sorry, no.

should it actually be working in this way,

YEs, combine ought to be able to cope with multiple dimensions in one pass, but I haven’t carefully read your invocation

@thwllms - I’m curious… since you control the data generation process, why not just use one big Zarr array, which you rear or write to in regions? Is it really necessary to create many smaller Zarr stores (i.e. tiles) and combine them using Kerhcunk?

The great thing about Zarr is that is can accommodate arrays of arbitrary size. (Unlike, say GeoTiff). Kerchunk is great for working with legacy files, or when you don’t control the generation process, but it may not be needed for your use case

1 Like

@rabernat great point, I hadn’t considered reading/writing specific regions as a way to manage task graph size challenges. That seems like the path of least resistance, and makes a zarr-of-zarrs feel a bit hat-on-a-hat.

1 Like

Awesome! There are lots of people who use Zarr this way for your specific use case: continental scale geospatial rasters. If you’re writing data from Xarray, the ds.to_zarr(..., region=...) pattern is really useful.

1 Like

We also know too well the issues that may come along with generating large zarrs and just released zappend, a tool to facilitate such typical tasks. You may have a look and provide us with feedback.

Some features:
zappend performs transaction-based dataset appends to existing Zarr data cubes. If an error occurs during an append step, typically due to I/O problems or out-of-memory, zappend will automatically roll back the operation, ensuring that the existing data cube maintains its structural integrity. It is easy to use and offers high configurability regarding filesystems, data source types, data cube outlines, and encoding. zappend comprises a command-line interface, a Python API, and a comprehensible documentation.

https://bcdev.github.io/zappend/
https://anaconda.org/conda-forge/zappend

1 Like