`.to_zarr(..., compute=False)` optimizations

Len · August 18, 2025, 5:41pm

I am looking for some tips to speed up .to_zarr(…, compute=False) when the dataset is large and coordinates many.

import dask.array as da
import xarray as xr
import numpy as np

example = xr.Dataset(
    data_vars={
        "variables": (
            ("band", "x", "y"),
            da.random.random((3, 200_000, 200_000), chunks=(3, 1024, 1024)),
        )
    },
    coords={"x": np.arange(200_000), "y": np.arange(200_000)},
)

# takes over 2 mins and slow down tremendously as data gets even bigger.
# example.to_zarr("/tmp/test.zarr", compute=False)

Versions:

zarr==3.1.1
xarray==2025.7.1

rabernat · August 18, 2025, 6:51pm

The likely problem here is that the Dask graph for that array is absolutely huge, which bogs everything down.

Instead, use this pattern:

Just use one single Dask chunk for the Xarray dataset
Specify chunk size via encoding

Here’s an example of that from our Serverless Data Cube Demo:

github.com/earth-mover/serverless-datacube-demo

src/lib.py

267651465


      
          def create_dataset_schema(self, storage) -> None:
              storage.initialize()
          
              big_ds = (
                  xr_zeros(self.geobox, chunks=-1, dtype="uint16")
                  .expand_dims(
                      {
                          "band": self.bands,
                          "time": self.time_data,
                      }
                  )
                  .transpose(..., "band")
              ).to_dataset(name=self.varname)
              big_ds.attrs["title"] = "Sentinel 2 Data Cube"
          
              lon_encoding = optimize_coord_encoding(big_ds.longitude.values, self.dx)
              lat_encoding = optimize_coord_encoding(big_ds.latitude.values, -self.dx)
              encoding = {
                  "longitude": {"chunks": big_ds.longitude.shape, **lon_encoding},
                  "latitude": {"chunks": big_ds.latitude.shape, **lat_encoding},

This file has been truncated. show original

An even better solution would be to stop overloading to_zarr and actually implement schema creation in Xarray. That has been discussed extensively here:

github.com/pydata/xarray

Add `metadata_only` param to `.to_zarr`?

opened 08:25PM - 19 Oct 23 UTC

max-sixty

enhancement topic-zarr

### Is your feature request related to a problem? A leaf from https://github.co…m/pydata/xarray/issues/8245, which has a bullet: > compute=False is arguably a less-than-obvious kwarg meaning "write metadata". Maybe this should be a method, maybe it's a candidate for renaming? Or maybe make_template can be an abstraction over it I've also noticed that for large arrays, running `compute=False` can take several minutes, despite the indexes being very small. I think this is because it's building a dask task graph — which is then discarded, since the array is written from different machines with the `region` pattern. ### Describe the solution you'd like Would introducing a `metadata_only` parameter to `to_zarr` help here: - Better name - No dask graph ### Describe alternatives you've considered _No response_ ### Additional context _No response_

Len · August 18, 2025, 7:32pm

@rabernat thanks for sharing! Nice trick with the coordinate encoding optimization–very cool.

Len · August 20, 2025, 2:28pm

I think it is also worth mentioning GitHub - xarray-contrib/rasterix: raster tools for xarray for the case where the underlying data is raster data.

Additionally to the chunk=-1 and encoding trick proposed above, you can effectively eliminate the need to write x and y coords altogether and compute on the fly which reduces bytes read/written to cloud storage.

import dask.array as da
import numpy as np
import rasterix
import xarray as xr

w = 2**20
h = 2**20

example = xr.Dataset(
    data_vars={
        "variables": (
            ("band", "x", "y"),
            da.random.random((128, w, h), chunks=(-1)),
        )
    },
    coords={"x": np.linspace(0, 5, w), "y": np.linspace(0, 5, h)},
)

# rasterix can read the GeoTransform from odc or rioxarray's spatial_ref convention
# so we make sure to write the transform
example = example.rio.write_transform(transform=example.rio.transform(recalc=True))

store1 = "/tmp/test1.zarr"
example.to_zarr(
    store1,
    compute=False,
    mode="w",
    zarr_format=3,
    consolidated=True,
)

store2 = "/tmp/test2.zarr"
example.drop(("x", "y")).to_zarr(
    store2,
    compute=False,
    mode="w",
    zarr_format=3,
    consolidated=True,
)

rt = xr.open_zarr(store1)
rt_wo_xy = xr.open_zarr(store2)
if "spatial_ref" in rt_wo_xy.data_vars:
    # sometimes this coord gets written as a variable
    rt_wo_xy = rt_wo_xy.set_coords("spatial_ref")
rt_wo_xy = rt_wo_xy.pipe(rasterix.assign_index)

# note the exact equality here that we don't get with compression tricks
np.testing.assert_equal(rt_wo_xy["x"].values, rt["x"])
np.testing.assert_equal(rt_wo_xy["y"].values, rt["y"])

A motivating use cases is for the case where x and y begin to get large and we don’t want to use the compression trick in the code above (and assume its slight precision errors).

This runs quickly locally, but as soon as you have to write and read many Mbs of coordinate data to/from the cloud we start to see much slower times when storing x and y coords.

The caveat is that any consumer of this data must compute x/y coords from the GeoTransform and rasterix makes this very convenient (with other niceties!).

rabernat · August 20, 2025, 3:52pm

I agree that the rasterix / affine coordinate is overall a much more elegant solution.

However, reading a few MB of coordinates should be no problem. Good compression and encoding choices can usually take this down to ~1MB even for very large domains. Make sure the coordinates are not chunked.

Topic		Replies	Views
High resolution time series; open_zarr question Science	3	1117	July 2, 2020
Writing to lat lon regions with to_zarr(region=) Data	10	2129	January 15, 2022
Extremely slow xarray/zarr writes Data	5	694	August 22, 2024
Map_blocks and to_zarr(region=) Data	6	1225	May 26, 2022
Advice on writing many slices from one remote zarr xarray to another Data	4	580	January 15, 2022

`.to_zarr(..., compute=False)` optimizations

Related topics