`.to_zarr(..., compute=False)` optimizations

I am looking for some tips to speed up .to_zarr(…, compute=False) when the dataset is large and coordinates many.

import dask.array as da
import xarray as xr
import numpy as np

example = xr.Dataset(
    data_vars={
        "variables": (
            ("band", "x", "y"),
            da.random.random((3, 200_000, 200_000), chunks=(3, 1024, 1024)),
        )
    },
    coords={"x": np.arange(200_000), "y": np.arange(200_000)},
)

# takes over 2 mins and slow down tremendously as data gets even bigger.
# example.to_zarr("/tmp/test.zarr", compute=False)

Versions:

  • zarr==3.1.1
  • xarray==2025.7.1
2 Likes

The likely problem here is that the Dask graph for that array is absolutely huge, which bogs everything down.

Instead, use this pattern:

  • Just use one single Dask chunk for the Xarray dataset
  • Specify chunk size via encoding

Here’s an example of that from our Serverless Data Cube Demo:

An even better solution would be to stop overloading to_zarr and actually implement schema creation in Xarray. That has been discussed extensively here:

3 Likes

@rabernat thanks for sharing! Nice trick with the coordinate encoding optimization–very cool.

I think it is also worth mentioning GitHub - xarray-contrib/rasterix: raster tools for xarray for the case where the underlying data is raster data.

Additionally to the chunk=-1 and encoding trick proposed above, you can effectively eliminate the need to write x and y coords altogether and compute on the fly which reduces bytes read/written to cloud storage.

import dask.array as da
import numpy as np
import rasterix
import xarray as xr

w = 2**20
h = 2**20

example = xr.Dataset(
    data_vars={
        "variables": (
            ("band", "x", "y"),
            da.random.random((128, w, h), chunks=(-1)),
        )
    },
    coords={"x": np.linspace(0, 5, w), "y": np.linspace(0, 5, h)},
)

# rasterix can read the GeoTransform from odc or rioxarray's spatial_ref convention
# so we make sure to write the transform
example = example.rio.write_transform(transform=example.rio.transform(recalc=True))

store1 = "/tmp/test1.zarr"
example.to_zarr(
    store1,
    compute=False,
    mode="w",
    zarr_format=3,
    consolidated=True,
)

store2 = "/tmp/test2.zarr"
example.drop(("x", "y")).to_zarr(
    store2,
    compute=False,
    mode="w",
    zarr_format=3,
    consolidated=True,
)

rt = xr.open_zarr(store1)
rt_wo_xy = xr.open_zarr(store2)
if "spatial_ref" in rt_wo_xy.data_vars:
    # sometimes this coord gets written as a variable
    rt_wo_xy = rt_wo_xy.set_coords("spatial_ref")
rt_wo_xy = rt_wo_xy.pipe(rasterix.assign_index)

# note the exact equality here that we don't get with compression tricks
np.testing.assert_equal(rt_wo_xy["x"].values, rt["x"])
np.testing.assert_equal(rt_wo_xy["y"].values, rt["y"])

A motivating use cases is for the case where x and y begin to get large and we don’t want to use the compression trick in the code above (and assume its slight precision errors).

This runs quickly locally, but as soon as you have to write and read many Mbs of coordinate data to/from the cloud we start to see much slower times when storing x and y coords.

The caveat is that any consumer of this data must compute x/y coords from the GeoTransform and rasterix makes this very convenient (with other niceties!).

1 Like

I agree that the rasterix / affine coordinate is overall a much more elegant solution.

However, reading a few MB of coordinates should be no problem. Good compression and encoding choices can usually take this down to ~1MB even for very large domains. Make sure the coordinates are not chunked.

1 Like