DL Training Dataset - to Zarr or not to Zarr?

jonasViehweger · July 16, 2025, 8:14am

Hello, first time contributor here. My hope is that someone with more experience can sanity check the approach I’ve cooked up so far to distribute training data.

Context

I have a bunch of manually labelled data. This data segments full Sentinel 2 pixel trajectories into segments of forest state (think: healthy, clear cut, revegetation, bark-beetle etc.).

The training data I am now collecting also takes context into account, so for each sample I have a stack of Sentinel 2 chips for the full labelled time-series.

Data structure

The data structure I have right now is separate zarrs per sample with these parameters:

dimensions: 128x128xtime
dtype: uint16
groups: Sentinel 2 bands
chunks: 128x128x32 (should yield chunks of ~1MB size)
Total number of samples ~4000, size on disk of around 150GB

I also have a parquet file with labels and other sampling metadata, which I will also use to construct train/test splits.

Question

In the end I’m looking to construct a pytorch dataloader like torchgeo/torchgeo/datasets/bigearthnet.py at main · torchgeo/torchgeo · GitHub. This dataset just uses tif files, it also does not provide time-series, so it is not exactly like my case.

I’ve also seen this blog post by earthmover Cloud native data loaders for machine learning using Zarr and Xarray - Earthmover which uses xbatcher to efficiently query a zarr for a deep learning pipeline. However this approach only uses a single zarr, while I have 1000s of small zarrs.

My main issues now is how to handle the chunking. Having smaller chunks would make the I/O way slower. However how it is now, once a chunk is loaded it should be fully passed through the training pipeline to make the loading worthwile. A single chunk is highly autocorrelated, so to have good variance within a batch, the batches would need to be quite large if full chunks are passed through.

Has someone done something like this before? Is using zarr in this case overkill?

rabernat · July 16, 2025, 5:13pm

Hello and welcome @jonasViehweger!

Could you clarify what constitutes an individual sample for your data? How many variables / bands per sample? Does the sample itself contain multiple time steps? How are time and sample related?

This is important because you want to optimize for loading individual samples as quickly as possible but also for random access to samples.

jonasViehweger · July 17, 2025, 7:18am

Hey @rabernat,

good question. The thing is that I am planning on providing a few different datasets which provide different features. I am providing Sentinel 2 bands B2 to B8A as well as B11, B12 and SCL. One thing that I would like to keep is that bands should be selectable.

Here’s some datasets I thought of, going from simple to more complex. To put dimensions to it, I’ll give time, lat, lon, bands in parentheses:

Single pixel, all S2 Bands (1,1,1,bands)
S2 band time-series (time,1,1,bands)
Single timestamp S2 chip (1,lat,lon,bands)
Chip time-series (time,lat,lon,bands)

My thought was that for the first two use-cases I will just derive a tabular dataset. Should be straightforward and make a lot more sense than try to make a zarr architecture that would efficiently support these use-cases.

For the other two use-cases for random access it would be best to have the time variable chunked to 1, however I am unsure how well that would work. I haven’t been able to make sharding work with xarray and noticed that with time chunks of 1, file operations on the windows machine I’m preparing the data on already take forever, so I am wary of the I/O performance with tiny chunks like 1x128x128.

If you have any experience with this, I’d be more than happy to get any insights.

Topic		Replies	Views
Best way to structure satellite data in Zarr for time-based indexing and batching with DataLoader Data machine-learning , zarr	1	148	September 6, 2025
Pangeo Showcase: "Cloud Native Data Loaders for Machine Learning Using Zarr and Xarray" Pangeo Showcase machine-learning	6	1000	October 25, 2024
Extremely slow rechunking of Zarr store with xarray Data	16	4203	October 22, 2021
Welcome, I need some support for the design of a forecast archive with Zarr Data	10	1207	April 23, 2022
Best Practice for Machine Learning with Huge Datasets Data machine-learning	1	635	October 26, 2024

DL Training Dataset - to Zarr or not to Zarr?

Context

Data structure

Question

Related topics