We’ve got a collection (300,000+) of hourly NetCDF files that constitute a 40 year long dataset on the USGS HPC system.
We used a SLURM jobarray to submit a bunch of rechunker jobs, creating a large collection of 6-day-long Zarr datasets.
We would now like to use another SLURM jobarray to insert these 6-day Zarr datasets into the proper time regions, creating a single 40 year Zarr dataset.
We can read one of the 6-day zarr datasets, then create a new empty Zarr dataset with compute=False, but the .zarray and .zmetadata files wouldn’t have the right shape for time.
We could then just edit those metadata files and change all the shape values for time, but there must be a cleaner way, right?
If there are no overlaps between chunks and they are all sequential, I think you can do this completely by just cleverly renaming files, without touching python. Imagine you have many sequential data arrays
If you only have to do this one time, I see nothing wrong with manually editing the .zarray files. That’s the beauty of Zarr! It’s just json files and binary chunks!
You can either use region= or extend along a single dimension with to_zarr(), but you can’t do both at the same time.
You can of course edit the Zarr metadata yourself, but personally I like to stick with Xarray for this sort of thing:
Create a lazy Xarray dataset in Dask the size of the entire result. I would typically do this with some combination of indexing, xarray.zeros_likes and xarray.concat/expand_dims on a single time slice, like in this example from Xarray-Beam.
Write the Zarr metadata using to_zarr() with compute=False
Write each chunk using region= from a separate processes.
These notes may also be helpful. Beam is just a convenient way to map over many tasks – the same pattern also works for other distributed engines.
What a great community! I love both of these answers. @rabernat’s is cool because it reminds us how we can easily manipulate Zarr datasets by hand, and @shoyer’s is cool because it uses a clean xarray approach and provides an efficient way to create the template dataset we then fill. We ended up going with the 2nd approach for transparency (the first approach would have been faster).
Here’s the resulting notebook we used for those who are interested in seeing the whole workflow (the notebook here shows the approach working for a few datasets – we are using a pure python version of this with SLURM jobarray to create the whole dataset in parallel.
If I’m understanding correctly, you have already transformed many NetCDFs to many Zarrs. The big downside of solution 2 from a performance perspective is that you have to read and write all the data again. It sounds like you have a huge dataset, so this might now be desirable. Solution 1 avoids any I/O and just manipulates file names.
I suppose the true best solution is to back up in your workflow to the point where you are creating that “large collection of 6-day-long Zarr datasets” and figure out how to put them directly into the final target array. Rechunker doesn’t currently support that, but it could. A related issue is here:
@rabernat , agreed! The data is already transformed and could just be manipulated into a consolidated dataset using bash. No need to use xarray or dask.
But we will be doing this workflow again, and next time we will be doing as you suggest, inserting each dataset as it’s created so we can submit a single SLURM script (with jobarray) to run the whole workflow.
I’ve been thinking a bit about a different variant of extending the time dimension by region writes. I want to grow a dataset which is being measured over years continuously. I want readers to only read what’s completely written and I want to be able to retry writing if something fails.
It’s possible to do interesting things by using an in-memory store for the metadata and the real store only for the data chunks: you can virtually resize your zarr in one process, just to be able to write out the chunks where they need to go. Up to that point, readers will ignore the additional chunks and the writer can do whatever is necessary to region-write some new chunks. After everything is finished, one just has to copy over the in-memory metadata objects to the on-disk store to “commit” the write.