How to efficiently overwrite existing zarr archive with reordered time axis? (Updated Question)

I am having the following problem:

I am using xarray.open_mfdataset(['archive1', 'archive2'], engine='zarr') to open multiple zarr archives at once. During creation of the archive2, data was append to the archive in a non chronological way. This means my time axis is neither monotonically increasing nor decreasing. This yields to an error when using open_mfdataset.

Actually it was possible to fix this for small to medium size archives by opening the archive with xarray.open_dataset, apply sort() and store the archive as a copy. Afterwards I removed the old one and renamed the new fixed version.

But now I am having this issue with a huge zarr archive and the machine runs out of memory (and it has 128 GB RAM ) .
So is there another way I can re-order a time axis of an zarr archive?

Can you please clarify: are the whole chunks out of time order, or are individual coordinates within each chunk also out of order?

The dataset is chunked along the time axis. So e.g. we have dimension (96, 1000, 1000), and chunk_size(1, 100, 100) for 96 timesteps.
Does this help?

If you have exactly one timestamp per chunk, then kerchunk can process this for you without any need for sorting or otherwise rewriting the data.

https://fsspec.github.io/kerchunk/

Thanks for this hint, I will check it out and give feedback to you asap.

I like Martin’s suggestion. Kerchunk allows you to create a translation layer between the chunks seen by the user and the chunks on disk.

You might also be able to Xarray’s lazy indexing capabilities to accomplish the same thing. It would work something like this (untested pseudocode)

ds = xr.open_dataset(zarr_store, engine="zarr")  # by default no dask
correct_time_order = ...  # figure this out somehow
ds_sorted = ds.isel(time=correct_time_order)
ds_sorted_chunked = ds_sorted.chunk({'time': desired_time_chunks})

I’m not 100% sure that zarr indexing in xarray supports this, but it’s worth a try.

I just confirmed that the idea above works. Here is some code to reproduce the solution I proposed.

import xarray as xr
import numpy as np

ds = xr.tutorial.open_dataset("air_temperature")
ds.chunk({"time": 1}).to_zarr("air_temperature.zarr", mode="w")
ds2 = xr.open_dataset("air_temperature.zarr", engine="zarr")
assert not ds2.air.variable._in_memory  # data have not been loaded yet
time_order = np.arange(ds2.dims['time'])
np.random.shuffle(time_order)
ds_sorted = ds2.isel(time=time_order)
assert not ds_sorted.air.variable._in_memory  # data still have not been loaded

# this triggers evaluation
xr.testing.assert_equal(
    ds.air.isel(time=time_order[0]),
    ds_sorted.air.isel(time=0)
)

In general, using ds.sortby('time') is working as well. This is my actual solution. But I was not able to write this dataset back to the disk or overwrite the existing one.

Using isel in the correct order is a really nice alternative but it would mean that I will not be able to use open_mfdataset() or am I wrong?

Actually I recognize that my question should be adapted to:

How to efficiently overwrite existing zarr archive with reordered time axis?

I would probably avoid trying to overwrite the dataset in place. Just write a new one, delete the old one, and rename it. Does that work?

Can you call ds.to_zarr on your reordered dataset?

Thats the way I do , but I am getting out of memory during this process because my dataset is very large.

It is working for smaller datasets but not for large ones.

I thought it would be possible to adapt the mapping in the json files which should be really fast and simple.

I’m just confused why that is happening. Are you sure you are following this sequence precisely:

  1. Open the original dataset without any chunks, i.e. chunks=None
  2. Do the reordering
  3. Then call .chunk with the desired time chunks
  4. Finally call .to_zarr

The kerchunk solution is a fine one, but at the same time it should be possible to make the xarray-based one work. If not, there is a bug somewhere that needs to be tracked.

1 Like

Dear @rabernat , today I had the pleasure to use rechunker the first time, and it works really well at the moment.

Do you think it is possible to expand the functionality of rechunker to reorder e.g. as time axis ?

This would solve the issue I mentioned above.

Best regards
Daniel

1 Like

I have a question, I have a combined zarr named bigzarrno5.zarr,it has dimensions valid_time, longitude, latitude, pressure_level,and it contain day 1,2,3,4,6 and the valid_time format is datatime64 like ‘2020-11-01T00:00:00.000000000’,each day have 24 hours.Now I have a day5.zarr,it has the same dimensions with bigzarrno5.zarr except valid time becase its time is from 2020-11-05T00:00:00.000000000 to 2020-11-05T23:00:00.000000000, i want to insert the day4.zarr to the bigzarrno5.zarr without rewrite the whole zarr, because i think the zarr file store data in chunk, i just need to store the day4.zarr’s chunk in a proper location and update the metadata, that’s means I didn’t need to change the original chunk.
Thank you so much