High resolution time series; open_zarr question

Hey all,

I have a question about xarray.open_zarr. My understanding is that that for timeseries data saved in zarr format, when opened xarray reads the .zmetadata and the contents of the time variable and load those into memory.

My question is, I have a very high density data (seconds) resolution for 5 - 6 years. For some, this can result to time variable of size 3GB+. Is there a way where I can distribute this initial read to dask workers? Or is there a known way to only grab a smaller subset of the data at open_zarr?

Thanks in advance. Any help/comments are much appreciated.

Best,
Don

1 Like

You can disable dask chunking as follows

ds = xr.open_zarr(path_to_store, chunks=False)

Then you can subset and apply chunking manually. However, it sounds like your problem is that your coordinate variable time is too large. Xarray will always read the coordinate variable eagerly (i.e. directly into memory) in order to create an index. This is a known issue with xarray

Until that issue is fixed, you have two options for workaround.

  • Don’t use xarray. Open the zarr arrays directly using dask. You will lose indexing capabilities and other fancy xarray stuff, but you may be able to do what you need.
  • Drop the time coordinate from the dataset before opening in xarray, so that you don’t need to generate an index for time. (You will lose time-indexing functions.)
1 Like

Thank you @rabernat. That really helps. I can’t quite getaway from having a smaller time variable. The sensor that I’m dealing with outputs at a high sampling rate. But, I will try your workaround suggestions.

1 Like

I don’t know if this would work but could you create a multi-level index eg. by day have a fixed index tile for the second of the day, grid the data onto a second of the day grid? and if you had multiple sensors you could stack them in this tile.

I do something similar extracting data from drone images into zarr all the tiles have the same dimensions in meters and then they have a time for each tile.

You could do hourly tiles of the data? that would reduce your master make the master index 3600 times smaller and depending on how you want to use the data e.g. get me an hourly mean! might make things work well