Extremely slow rechunking of Zarr store with xarray

@rabernat Would it be helpful to re-chunk the whole dataset after I open the Zarr store to make more efficient access to the time series for each (x, y) pair of coordinates? Or do I have to write newly rechunked data to the disk before accessing the data to take advantage of the re-chunking?
This is my first encounter with Zarr chunking so I apologize if it’s a trivial question.

A number of other questions that I have:

  • I am not very clear on how to determine an optimal chunking for the purpose of data processing when chunking exists in t, x and y dimensions for original dataset. Just try different ones (that’s what we were doing) and see how it works for the kind of processing we do?
  • Would it help at all to re-chunk in time dimension to the whole time dimension size? In other words, if our dataset has dimensions of t: 11000, x: 800, y: 800, do I need to re-chunk it with t: 11000, x: 10, y: 10? Or what if I re-chunk x and y at their dimension size since these are fixed values for the whole dataset?
  • Also it would seem that increasing chunk size for the x and y would help with the access time to all x’s and y’s that belong to the same chunk.
  • When re-chunking, why do I want to keep the original chunk size of the dataset the same? If previous chunk size is 128Mb, should re-chunked chunk size be also 128Mb? This is in relation to the note you made in the epic post, and I don’t really understand why (to guarantee proximity of new chunks, perhaps)?

Thank you so much for all clarifications and help!

1 Like