Read multiple tiff image using zarr

Hi Friends ,

i have multiple tiff images each for every 5 min and having dimension of 4400 x 4400 and is having ghi value stored in variable .
I need to create a python api which can give historical data of ghi values for a given latitude and longitude . The data i need is for 1 day , 1 week and 1 month which ever the user selects .
What will be the best approach for this .
Currently i am using zarr but its taking long time to read the data . For getting data for 1 month its taking around 2 min and the data is stored in s3 and script is running on EC2 server .

Iā€™m guessing the poor performance to read a time series are because the data was stored in Zarr using the same chunking scheme as the original data (time=1, x=4400, y=4400)?

To allow time series extraction in a reasonable length of time, you would want something (time=144, x=400, y=400), which for floats or 32-bit integers would be about 100MB chunks.

If you used this chunking scheme for 2 years of hourly data, users who want to read a time series at a specified x,y location would read about the same number of chunks as a user who wants to read the entire x,y field at a specified time:

(4400*4400)/(400*400) = 121   
2*(365.25*24)/144 = 121.74

With a cluster of 30 workers, the read times would be a few seconds for each. Does this make sense?

2 Likes