Extremely slow rechunking of Zarr store with xarray

Thanks for the question. We’ll do our best to help. I’d like to first offer some feedback on how to make your post more easily “answerable.”

  1. Include code / data examples that will allow us to reproduce your problem.
  2. Your post contains several related but distinct questions. You might get a better response by just asking one question at a time.

Let me try to enumerate the distinct questions

2x longer than what? How are you measuring “access time”? I don’t understand what you are asking here., so I’m not going to try to answer this part unless you can provide some clarification.

This sounds like it could be a bug with Xarray. Xarray should respect the chunks attribute the variable encoding when writing to Zarr. To check this, I wrote a short test:

import xarray as xr
import numpy as np
import zarr

ds = xr.DataArray(
    np.arange(100),
    name='foo',
    dims='x'
).to_dataset()
ds['foo'].encoding = {'chunks': 10}
ds.to_zarr('test.zarr', mode='w')

zgroup = zarr.open('test.zarr')
assert zgroup['foo'].chunks == (10,)

This verifies that Xarray is using the chunks specified in the encoding when writing the Zarr array. Without seeing your code, it’s hard to know what is going wrong in your case.

It sounds like you want to create a Zarr array that is contiguous in time but chunked in the other dimenions (x,y ); and then you want to append to your array in time. This will never work well in Zarr. The reason is fundamental: whenever you make a write that touches a Zarr chunk, the entire chunk has to be rewritten. Zarr does not support “partial chunk writes.” (Note that other storage formats like TileDB might work better here.) So every time you append to your array that is chunked this way, you basically have to rewrite the entire file. If your ingestion process requires you to append to a Zarr array along the time dimension, you are definitely better off chunking the array in time from the beginning.

However, it also sounds like you want to do timeseries analysis at each point. This requires an access pattern that is orthogonal to your write pattern. :man_facepalming: This was the same issue discussed in this epic thread.

The best solution we can offer is the rechunker package, which you are already using.

This should definitely not take so long. On a fast machine, you should be able to rechunk 5.4 GB of data in just a few seconds. If you provide more details of how you are invoking rechunker, maybe we can help you improve this.

This is probably not a good idea. You should think hard about your access patterns and choose chunks that are optimized for your use case. There is no universal “good” chunking schemes. Everything depends on how you will access the data. Just remember the main rule: there are no partial reads / writes to Zarr chunks. If your operation touches a single item in a chunk, the whole chunk needs to be read / written.

As a final question, let me ask this: why Zarr? Did you try a more mature format like HDF5 before deciding that you needed Zarr? How did that go?

1 Like