I’m guessing the poor performance to read a time series are because the data was stored in Zarr using the same chunking scheme as the original data (time=1, x=4400, y=4400)?
To allow time series extraction in a reasonable length of time, you would want something (time=144, x=400, y=400), which for floats or 32-bit integers would be about 100MB chunks.
If you used this chunking scheme for 2 years of hourly data, users who want to read a time series at a specified x,y location would read about the same number of chunks as a user who wants to read the entire x,y field at a specified time:
(4400*4400)/(400*400) = 121
2*(365.25*24)/144 = 121.74
With a cluster of 30 workers, the read times would be a few seconds for each. Does this make sense?