DL Training Dataset - to Zarr or not to Zarr?

Hello, first time contributor here. My hope is that someone with more experience can sanity check the approach I’ve cooked up so far to distribute training data.

Context

I have a bunch of manually labelled data. This data segments full Sentinel 2 pixel trajectories into segments of forest state (think: healthy, clear cut, revegetation, bark-beetle etc.).

The training data I am now collecting also takes context into account, so for each sample I have a stack of Sentinel 2 chips for the full labelled time-series.

Data structure

The data structure I have right now is separate zarrs per sample with these parameters:

  • dimensions: 128x128xtime
  • dtype: uint16
  • groups: Sentinel 2 bands
  • chunks: 128x128x32 (should yield chunks of ~1MB size)
  • Total number of samples ~4000, size on disk of around 150GB

I also have a parquet file with labels and other sampling metadata, which I will also use to construct train/test splits.

Question

In the end I’m looking to construct a pytorch dataloader like torchgeo/torchgeo/datasets/bigearthnet.py at main · microsoft/torchgeo · GitHub. This dataset just uses tif files, it also does not provide time-series, so it is not exactly like my case.

I’ve also seen this blog post by earthmover Cloud native data loaders for machine learning using Zarr and Xarray - Earthmover which uses xbatcher to efficiently query a zarr for a deep learning pipeline. However this approach only uses a single zarr, while I have 1000s of small zarrs.

My main issues now is how to handle the chunking. Having smaller chunks would make the I/O way slower. However how it is now, once a chunk is loaded it should be fully passed through the training pipeline to make the loading worthwile. A single chunk is highly autocorrelated, so to have good variance within a batch, the batches would need to be quite large if full chunks are passed through.

Has someone done something like this before? Is using zarr in this case overkill?

Hello and welcome @jonasViehweger! :waving_hand:

Could you clarify what constitutes an individual sample for your data? How many variables / bands per sample? Does the sample itself contain multiple time steps? How are time and sample related?

This is important because you want to optimize for loading individual samples as quickly as possible but also for random access to samples.

Hey @rabernat,

good question. The thing is that I am planning on providing a few different datasets which provide different features. I am providing Sentinel 2 bands B2 to B8A as well as B11, B12 and SCL. One thing that I would like to keep is that bands should be selectable.

Here’s some datasets I thought of, going from simple to more complex. To put dimensions to it, I’ll give time, lat, lon, bands in parentheses:

  • Single pixel, all S2 Bands (1,1,1,bands)
  • S2 band time-series (time,1,1,bands)
  • Single timestamp S2 chip (1,lat,lon,bands)
  • Chip time-series (time,lat,lon,bands)

My thought was that for the first two use-cases I will just derive a tabular dataset. Should be straightforward and make a lot more sense than try to make a zarr architecture that would efficiently support these use-cases.

For the other two use-cases for random access it would be best to have the time variable chunked to 1, however I am unsure how well that would work. I haven’t been able to make sharding work with xarray and noticed that with time chunks of 1, file operations on the windows machine I’m preparing the data on already take forever, so I am wary of the I/O performance with tiny chunks like 1x128x128.

If you have any experience with this, I’d be more than happy to get any insights.