Hello, first time contributor here. My hope is that someone with more experience can sanity check the approach I’ve cooked up so far to distribute training data.
Context
I have a bunch of manually labelled data. This data segments full Sentinel 2 pixel trajectories into segments of forest state (think: healthy, clear cut, revegetation, bark-beetle etc.).
The training data I am now collecting also takes context into account, so for each sample I have a stack of Sentinel 2 chips for the full labelled time-series.
Data structure
The data structure I have right now is separate zarrs per sample with these parameters:
- dimensions: 128x128xtime
- dtype: uint16
- groups: Sentinel 2 bands
- chunks: 128x128x32 (should yield chunks of ~1MB size)
- Total number of samples ~4000, size on disk of around 150GB
I also have a parquet file with labels and other sampling metadata, which I will also use to construct train/test splits.
Question
In the end I’m looking to construct a pytorch dataloader like torchgeo/torchgeo/datasets/bigearthnet.py at main · microsoft/torchgeo · GitHub. This dataset just uses tif files, it also does not provide time-series, so it is not exactly like my case.
I’ve also seen this blog post by earthmover Cloud native data loaders for machine learning using Zarr and Xarray - Earthmover which uses xbatcher to efficiently query a zarr for a deep learning pipeline. However this approach only uses a single zarr, while I have 1000s of small zarrs.
My main issues now is how to handle the chunking. Having smaller chunks would make the I/O way slower. However how it is now, once a chunk is loaded it should be fully passed through the training pipeline to make the loading worthwile. A single chunk is highly autocorrelated, so to have good variance within a batch, the batches would need to be quite large if full chunks are passed through.
Has someone done something like this before? Is using zarr in this case overkill?