High resolution time series; open_zarr question

Hey all,

I have a question about xarray.open_zarr. My understanding is that that for timeseries data saved in zarr format, when opened xarray reads the .zmetadata and the contents of the time variable and load those into memory.

My question is, I have a very high density data (seconds) resolution for 5 - 6 years. For some, this can result to time variable of size 3GB+. Is there a way where I can distribute this initial read to dask workers? Or is there a known way to only grab a smaller subset of the data at open_zarr?

Thanks in advance. Any help/comments are much appreciated.

Best,
Don

1 Like

You can disable dask chunking as follows

ds = xr.open_zarr(path_to_store, chunks=False)

Then you can subset and apply chunking manually. However, it sounds like your problem is that your coordinate variable time is too large. Xarray will always read the coordinate variable eagerly (i.e. directly into memory) in order to create an index. This is a known issue with xarray

Until that issue is fixed, you have two options for workaround.

  • Don’t use xarray. Open the zarr arrays directly using dask. You will lose indexing capabilities and other fancy xarray stuff, but you may be able to do what you need.
  • Drop the time coordinate from the dataset before opening in xarray, so that you don’t need to generate an index for time. (You will lose time-indexing functions.)

Thank you @rabernat. That really helps. I can’t quite getaway from having a smaller time variable. The sensor that I’m dealing with outputs at a high sampling rate. But, I will try your workaround suggestions.

1 Like