Best way to structure satellite data in Zarr for time-based indexing and batching with DataLoader

Hi everyone,

I ‘m working on a project with satellite data that spans across multiple timestamps, and I ‘d love some advice on structuring it efficiently.

The data

For each variable (satellite channels), I have several timestamps.

For each timestamp, I have multiple 256×256 samples, each from a different region.

The number of samples varies across timestamps, and I don’t need georeferencing information.

A naive representation of the data could look like this:

[variables, timestamps, n_samples, 256, 256]

The challenge

After selecting the variables I want, I need to support two main use cases:

  1. Index by time - straightforward with a time axis as partitioned above.
  2. Iterate through samples over a time range - to create a DataLoader with a fixed batch size (e.g., 32).

I ‘m using .zarr files with chunking for efficient retrieval. My initial idea was to use chunking like:

[1, 1, 32, 256, 256]

so each chunk matches a batch of 32 samples. But because samples are spread across different timestamps, batching becomes messy:

  • Some batches won’t have 32 samples.
  • I ’d need to “spill over” into the next timestamp and then adjust the next batch accordingly.
  • This complicates the __getitem__ and slows things down, especially during model training.

Possible solutions

One idea would be to flatten the [timestamps, samples] axes into a single dimension:

[variables, timestamps*n_samples, 256, 256]

This makes batching easy (I would just step through the 2nd axis in chunks of 32).
But the downsides are:

  • I lose the explicit timestamp axis.
  • I ’d need a separate array to store timestamp info per sample.
  • Indexing by time range would require filtering through this array first (possibly of logn computational complexity if the array is sorted).

My question

Is there a better way to handle this problem?
If not, which of the two approaches would you recommend, given the need to:

  • Index efficiently by timestamp or time range.
  • Iterate smoothly with an arbitrary batch size.

Thank you in advance for any suggestions!

1 Like

Assuming that the data is stored in some cloud storage such as S3, I’m not sure how good of a fit Zarr is to what you need. I’m not saying it’s a bad fit, I’m just not sure.

It feels like your dataset is already chunked in very small pieces (1/4 MB uncompressed) and you’re also clearly stating that you’re interested in batching, not chunking, which makes perfect sense.

You mentioned that the samples are different regions. Are they lacking georeference or are you simply not interested in it? Is it possible to combine those samples geographically into one array? If yes, how big would that array be?

Assuming it’s not possible to do batching in the x & y axis, here are some options to consider:

  1. Store the pre-batched samples as S3 objects under a structured path that contains the datavar and timestamp, and wrap querying and retrieval in Python (maybe using unix timestamp for easy time range querying)
  2. Use a catalog (DynamoDB or other RDS) that’s properly partitioned and indexed

If geographical merging is possible, build one Zarr (that you append to) using Xarray: one DataArray per variable, and [time, x, y] dims in the DataArray. Start with [1, 500, 500] chunking, which will result in ~1 MB chunks uncompressed, and work your way up to a good chunk size (general rule of thumb is tens of MB but I think 1 MB is not bad, it really depends on the use case).