Using to_zarr(region=) and extending the time dimension?

rsignell · February 2, 2022, 8:51pm

We’ve got a collection (300,000+) of hourly NetCDF files that constitute a 40 year long dataset on the USGS HPC system.

We used a SLURM jobarray to submit a bunch of rechunker jobs, creating a large collection of 6-day-long Zarr datasets.

We would now like to use another SLURM jobarray to insert these 6-day Zarr datasets into the proper time regions, creating a single 40 year Zarr dataset.

We can read one of the 6-day zarr datasets, then create a new empty Zarr dataset with compute=False, but the .zarray and .zmetadata files wouldn’t have the right shape for time.

We could then just edit those metadata files and change all the shape values for time, but there must be a cleaner way, right?

rabernat · February 2, 2022, 9:15pm

If there are no overlaps between chunks and they are all sequential, I think you can do this completely by just cleverly renaming files, without touching python. Imagine you have many sequential data arrays

timestep_000
    array1/.zarray
    array1/0.0.0.0
timestep_001
    array1/.zarray
    array1/0.0.0.0

You can concatenate these into a single array just by cleverly renaming the chunks

mv timestep_000/array1/0.0.0.0 combined/array1/0.0.0.0
mv timestep_001/array1/0.0.0.0 combined/array1/1.0.0.0

This can be done with a bash script.

If you only have to do this one time, I see nothing wrong with manually editing the .zarray files. That’s the beauty of Zarr! It’s just json files and binary chunks!

shoyer · February 3, 2022, 3:59am

You can either use region= or extend along a single dimension with to_zarr(), but you can’t do both at the same time.

You can of course edit the Zarr metadata yourself, but personally I like to stick with Xarray for this sort of thing:

Create a lazy Xarray dataset in Dask the size of the entire result. I would typically do this with some combination of indexing, xarray.zeros_likes and xarray.concat/expand_dims on a single time slice, like in this example from Xarray-Beam.
Write the Zarr metadata using to_zarr() with compute=False
Write each chunk using region= from a separate processes.

These notes may also be helpful. Beam is just a convenient way to map over many tasks – the same pattern also works for other distributed engines.

rsignell · February 3, 2022, 2:52pm

What a great community! I love both of these answers. @rabernat’s is cool because it reminds us how we can easily manipulate Zarr datasets by hand, and @shoyer’s is cool because it uses a clean xarray approach and provides an efficient way to create the template dataset we then fill. We ended up going with the 2nd approach for transparency (the first approach would have been faster).

Here’s the resulting notebook we used for those who are interested in seeing the whole workflow (the notebook here shows the approach working for a few datasets – we are using a pure python version of this with SLURM jobarray to create the whole dataset in parallel.

rabernat · February 3, 2022, 2:58pm

If I’m understanding correctly, you have already transformed many NetCDFs to many Zarrs. The big downside of solution 2 from a performance perspective is that you have to read and write all the data again. It sounds like you have a huge dataset, so this might now be desirable. Solution 1 avoids any I/O and just manipulates file names.

I suppose the true best solution is to back up in your workflow to the point where you are creating that “large collection of 6-day-long Zarr datasets” and figure out how to put them directly into the final target array. Rechunker doesn’t currently support that, but it could. A related issue is here:

github.com/pangeo-data/rechunker

Incremental rechunk

opened 02:43PM - 11 Jun 20 UTC

davidbrochart

rechunker solves a problem [I was trying to solve](https://medium.com/@david.bro…chart/transposing-chunks-in-zarr-datasets-f906a23b411f) in a much cleaner way, thanks a lot for working on that. I've tried on the GPM dataset and it seems to work fine. Do you know if it would work in an incremental mode? By that I mean that if I have already rechunked a part of a dataset, and want to continue later on, is it possible to rechunk only the remaining source and append that to the already rechunked destination?

rsignell · February 3, 2022, 3:07pm

@rabernat , agreed! The data is already transformed and could just be manipulated into a consolidated dataset using bash. No need to use xarray or dask.

But we will be doing this workflow again, and next time we will be doing as you suggest, inserting each dataset as it’s created so we can submit a single SLURM script (with jobarray) to run the whole workflow.

martindurant · February 3, 2022, 10:38pm

You can, of course kerchunk over many zarrs and not have to edit anything

rsignell · February 4, 2022, 2:25pm

@martindurant, I was unaware that we could use kerchunk on many zarrs!
Is there an example of how to do that?

martindurant · February 4, 2022, 2:41pm

There is no such example workflow, no, but I wrote this to make testing easier.

github.com

fsspec/kerchunk/blob/main/kerchunk/zarr.py#L5

      
        
            
            
import fsspec
            
            

            
def single_zarr(uri_or_store, storage_options=None):
                """kerchunk-style view on zarr mapper
            
            
    This is a similar process to zarr's consolidate_metadata, but does not
                need to be held in the original file tree. You do not need zarr itself
                to do this.
            
            
    This is useful for testing, so that we can pass hand-made zarrs to combine.
                """
            
            
    # TODO: support a zarr group or FSStore?

RichardScottOZ · May 31, 2022, 5:59am

I have this sort of problem currently.

If I understand correctly @shoyer, append will work if all your variables are the same?

I think from reading, region= will work with different variables if you have made a skeleton first

e.g. I have model, y, x : variables = 1300
model, y, x: variables = 1090
model, y,x, variables = 760
etc.

variables here being analysis of features and model results

whereas in an old version just had model results, so all the same, hence append was fine

d70-t · June 22, 2022, 5:11pm

I’ve been thinking a bit about a different variant of extending the time dimension by region writes. I want to grow a dataset which is being measured over years continuously. I want readers to only read what’s completely written and I want to be able to retry writing if something fails.

It’s possible to do interesting things by using an in-memory store for the metadata and the real store only for the data chunks: you can virtually resize your zarr in one process, just to be able to write out the chunks where they need to go. Up to that point, readers will ignore the additional chunks and the writer can do whatever is necessary to region-write some new chunks. After everything is finished, one just has to copy over the in-memory metadata objects to the on-disk store to “commit” the write.

Topic		Replies	Views
Insert a small zarr into a big zarr in time order Data zarr	3	101	October 2, 2024
Optimising Access For Zarr on S3 Data by LAT/LONG (Dask) Data	11	1629	April 25, 2022
Saving larger-than-memory objects to zarr using dask and xarray Data zarr	9	490	December 3, 2024
Any suggestions for efficiently operating over windows of data? Data	4	1181	February 2, 2023
High resolution time series; open_zarr question Science	3	1078	July 2, 2020

Using to_zarr(region=) and extending the time dimension?

Related topics