Using to_zarr(region=) and extending the time dimension?

We’ve got a collection (300,000+) of hourly NetCDF files that constitute a 40 year long dataset on the USGS HPC system.

We used a SLURM jobarray to submit a bunch of rechunker jobs, creating a large collection of 6-day-long Zarr datasets.

We would now like to use another SLURM jobarray to insert these 6-day Zarr datasets into the proper time regions, creating a single 40 year Zarr dataset.

We can read one of the 6-day zarr datasets, then create a new empty Zarr dataset with compute=False, but the .zarray and .zmetadata files wouldn’t have the right shape for time.

We could then just edit those metadata files and change all the shape values for time, but there must be a cleaner way, right?

If there are no overlaps between chunks and they are all sequential, I think you can do this completely by just cleverly renaming files, without touching python. Imagine you have many sequential data arrays


You can concatenate these into a single array just by cleverly renaming the chunks

mv timestep_000/array1/ combined/array1/
mv timestep_001/array1/ combined/array1/

This can be done with a bash script.

If you only have to do this one time, I see nothing wrong with manually editing the .zarray files. That’s the beauty of Zarr! It’s just json files and binary chunks!


You can either use region= or extend along a single dimension with to_zarr(), but you can’t do both at the same time.

You can of course edit the Zarr metadata yourself, but personally I like to stick with Xarray for this sort of thing:

  1. Create a lazy Xarray dataset in Dask the size of the entire result. I would typically do this with some combination of indexing, xarray.zeros_likes and xarray.concat/expand_dims on a single time slice, like in this example from Xarray-Beam.
  2. Write the Zarr metadata using to_zarr() with compute=False
  3. Write each chunk using region= from a separate processes.

These notes may also be helpful. Beam is just a convenient way to map over many tasks – the same pattern also works for other distributed engines.


What a great community! I love both of these answers. @rabernat’s is cool because it reminds us how we can easily manipulate Zarr datasets by hand, and @shoyer’s is cool because it uses a clean xarray approach and provides an efficient way to create the template dataset we then fill. We ended up going with the 2nd approach for transparency (the first approach would have been faster).

Here’s the resulting notebook we used for those who are interested in seeing the whole workflow (the notebook here shows the approach working for a few datasets – we are using a pure python version of this with SLURM jobarray to create the whole dataset in parallel.


If I’m understanding correctly, you have already transformed many NetCDFs to many Zarrs. The big downside of solution 2 from a performance perspective is that you have to read and write all the data again. It sounds like you have a huge dataset, so this might now be desirable. Solution 1 avoids any I/O and just manipulates file names.

I suppose the true best solution is to back up in your workflow to the point where you are creating that “large collection of 6-day-long Zarr datasets” and figure out how to put them directly into the final target array. Rechunker doesn’t currently support that, but it could. A related issue is here:

1 Like

@rabernat , agreed! The data is already transformed and could just be manipulated into a consolidated dataset using bash. No need to use xarray or dask.

But we will be doing this workflow again, and next time we will be doing as you suggest, inserting each dataset as it’s created so we can submit a single SLURM script (with jobarray) to run the whole workflow.

You can, of course kerchunk over many zarrs and not have to edit anything :slight_smile:

@martindurant, I was unaware that we could use kerchunk on many zarrs!
Is there an example of how to do that?

There is no such example workflow, no, but I wrote this to make testing easier.

1 Like