High resolution time series; open_zarr question

lsetiawan · May 11, 2020, 5:23pm

Hey all,

I have a question about xarray.open_zarr. My understanding is that that for timeseries data saved in zarr format, when opened xarray reads the .zmetadata and the contents of the time variable and load those into memory.

My question is, I have a very high density data (seconds) resolution for 5 - 6 years. For some, this can result to time variable of size 3GB+. Is there a way where I can distribute this initial read to dask workers? Or is there a known way to only grab a smaller subset of the data at open_zarr?

Thanks in advance. Any help/comments are much appreciated.

Best,
Don

rabernat · May 13, 2020, 2:20pm

You can disable dask chunking as follows

ds = xr.open_zarr(path_to_store, chunks=False)

Then you can subset and apply chunking manually. However, it sounds like your problem is that your coordinate variable time is too large. Xarray will always read the coordinate variable eagerly (i.e. directly into memory) in order to create an index. This is a known issue with xarray

Until that issue is fixed, you have two options for workaround.

Don’t use xarray. Open the zarr arrays directly using dask. You will lose indexing capabilities and other fancy xarray stuff, but you may be able to do what you need.
Drop the time coordinate from the dataset before opening in xarray, so that you don’t need to generate an index for time. (You will lose time-indexing functions.)

lsetiawan · May 13, 2020, 4:09pm

Thank you @rabernat. That really helps. I can’t quite getaway from having a smaller time variable. The sensor that I’m dealing with outputs at a high sampling rate. But, I will try your workaround suggestions.

NickMortimer · July 2, 2020, 9:02am

I don’t know if this would work but could you create a multi-level index eg. by day have a fixed index tile for the second of the day, grid the data onto a second of the day grid? and if you had multiple sensors you could stack them in this tile.

I do something similar extracting data from drone images into zarr all the tiles have the same dimensions in meters and then they have a time for each tile.

You could do hourly tiles of the data? that would reduce your master make the master index 3600 times smaller and depending on how you want to use the data e.g. get me an hourly mean! might make things work well

Topic		Replies	Views
Saving larger-than-memory objects to zarr using dask and xarray Data zarr	9	579	December 3, 2024
Xarray slow read on cluster Data machine-learning	4	200	November 3, 2024
Any suggestions for efficiently operating over windows of data? Data	4	1193	February 2, 2023
Optimising Access For Zarr on S3 Data by LAT/LONG (Dask) Data	11	1648	April 25, 2022
Using to_zarr(region=) and extending the time dimension? Data	10	2366	June 22, 2022

High resolution time series; open_zarr question

Related topics