Hi everyone,
I ‘m working on a project with satellite data that spans across multiple timestamps, and I ‘d love some advice on structuring it efficiently.
The data
For each variable (satellite channels), I have several timestamps.
For each timestamp, I have multiple 256×256 samples, each from a different region.
The number of samples varies across timestamps, and I don’t need georeferencing information.
A naive representation of the data could look like this:
[variables, timestamps, n_samples, 256, 256]
The challenge
After selecting the variables I want, I need to support two main use cases:
- Index by time - straightforward with a time axis as partitioned above.
- Iterate through samples over a time range - to create a DataLoader with a fixed batch size (e.g., 32).
I ‘m using .zarr files with chunking for efficient retrieval. My initial idea was to use chunking like:
[1, 1, 32, 256, 256]
so each chunk matches a batch of 32 samples. But because samples are spread across different timestamps, batching becomes messy:
- Some batches won’t have 32 samples.
- I ’d need to “spill over” into the next timestamp and then adjust the next batch accordingly.
- This complicates the
__getitem__and slows things down, especially during model training.
Possible solutions
One idea would be to flatten the [timestamps, samples] axes into a single dimension:
[variables, timestamps*n_samples, 256, 256]
This makes batching easy (I would just step through the 2nd axis in chunks of 32).
But the downsides are:
- I lose the explicit timestamp axis.
- I ’d need a separate array to store timestamp info per sample.
- Indexing by time range would require filtering through this array first (possibly of logn computational complexity if the array is sorted).
My question
Is there a better way to handle this problem?
If not, which of the two approaches would you recommend, given the need to:
- Index efficiently by timestamp or time range.
- Iterate smoothly with an arbitrary batch size.
Thank you in advance for any suggestions!