Hi Friends ,
i have multiple tiff images each for every 5 min and having dimension of 4400 x 4400 and is having ghi value stored in variable .
I need to create a python api which can give historical data of ghi values for a given latitude and longitude . The data i need is for 1 day , 1 week and 1 month which ever the user selects .
What will be the best approach for this .
Currently i am using zarr but its taking long time to read the data . For getting data for 1 month its taking around 2 min and the data is stored in s3 and script is running on EC2 server .
Iām guessing the poor performance to read a time series are because the data was stored in Zarr using the same chunking scheme as the original data (time=1, x=4400, y=4400)?
To allow time series extraction in a reasonable length of time, you would want something (time=144, x=400, y=400), which for floats or 32-bit integers would be about 100MB chunks.
If you used this chunking scheme for 2 years of hourly data, users who want to read a time series at a specified x,y location would read about the same number of chunks as a user who wants to read the entire x,y field at a specified time:
(4400*4400)/(400*400) = 121
2*(365.25*24)/144 = 121.74
With a cluster of 30 workers, the read times would be a few seconds for each. Does this make sense?
2 Likes