Best way to structure satellite data in Zarr for time-based indexing and batching with DataLoader

gdravilas · September 2, 2025, 4:21pm

Hi everyone,

I ‘m working on a project with satellite data that spans across multiple timestamps, and I ‘d love some advice on structuring it efficiently.

The data

For each variable (satellite channels), I have several timestamps.

For each timestamp, I have multiple 256×256 samples, each from a different region.

The number of samples varies across timestamps, and I don’t need georeferencing information.

A naive representation of the data could look like this:

[variables, timestamps, n_samples, 256, 256]

The challenge

After selecting the variables I want, I need to support two main use cases:

Index by time - straightforward with a time axis as partitioned above.
Iterate through samples over a time range - to create a DataLoader with a fixed batch size (e.g., 32).

I ‘m using .zarr files with chunking for efficient retrieval. My initial idea was to use chunking like:

[1, 1, 32, 256, 256]

so each chunk matches a batch of 32 samples. But because samples are spread across different timestamps, batching becomes messy:

Some batches won’t have 32 samples.
I ’d need to “spill over” into the next timestamp and then adjust the next batch accordingly.
This complicates the __getitem__ and slows things down, especially during model training.

Possible solutions

One idea would be to flatten the [timestamps, samples] axes into a single dimension:

[variables, timestamps*n_samples, 256, 256]

This makes batching easy (I would just step through the 2nd axis in chunks of 32).
But the downsides are:

I lose the explicit timestamp axis.
I ’d need a separate array to store timestamp info per sample.
Indexing by time range would require filtering through this array first (possibly of logn computational complexity if the array is sorted).

My question

Is there a better way to handle this problem?
If not, which of the two approaches would you recommend, given the need to:

Index efficiently by timestamp or time range.
Iterate smoothly with an arbitrary batch size.

Thank you in advance for any suggestions!

sotosoul · September 6, 2025, 8:38pm

Assuming that the data is stored in some cloud storage such as S3, I’m not sure how good of a fit Zarr is to what you need. I’m not saying it’s a bad fit, I’m just not sure.

It feels like your dataset is already chunked in very small pieces (1/4 MB uncompressed) and you’re also clearly stating that you’re interested in batching, not chunking, which makes perfect sense.

You mentioned that the samples are different regions. Are they lacking georeference or are you simply not interested in it? Is it possible to combine those samples geographically into one array? If yes, how big would that array be?

Assuming it’s not possible to do batching in the x & y axis, here are some options to consider:

Store the pre-batched samples as S3 objects under a structured path that contains the datavar and timestamp, and wrap querying and retrieval in Python (maybe using unix timestamp for easy time range querying)
Use a catalog (DynamoDB or other RDS) that’s properly partitioned and indexed

If geographical merging is possible, build one Zarr (that you append to) using Xarray: one DataArray per variable, and [time, x, y] dims in the DataArray. Start with [1, 500, 500] chunking, which will result in ~1 MB chunks uncompressed, and work your way up to a good chunk size (general rule of thumb is tens of MB but I think 1 MB is not bad, it really depends on the use case).

Topic		Replies	Views
DL Training Dataset - to Zarr or not to Zarr? Data	2	176	July 17, 2025
Extremely slow rechunking of Zarr store with xarray Data	16	4203	October 22, 2021
Welcome, I need some support for the design of a forecast archive with Zarr Data	10	1207	April 23, 2022
Optimising Access For Zarr on S3 Data by LAT/LONG (Dask) Data	11	1709	April 25, 2022
Read multiple tiff image using zarr Data	1	1399	January 20, 2021

Best way to structure satellite data in Zarr for time-based indexing and batching with DataLoader

The data

The challenge

Possible solutions

My question

Related topics