Hi @ThomasMGeo, the answer on ‘how’ to read 10-100s of GBs of NetCDF files partly depends on whether you want to go for A) pure speed, or B) readability/metadata preservation.
If speed is the main goal, then you’ll probably want to convert those NetCDFs into a more tensor-friendly format like tfrecords, .npy files, webdataset, or so on. Zarr might be speedy as well if you can optimize the chunk sizes (which is another topic in itself).
If you like metadata and are looking at keeping things more in the xarray world (at least until the last minute when you throw things into the GPU), then xbatcher
is definitely recommended for doing multi-dimensional name-based slicing or fancy indexing. Cc @maxrjones and @jhamman. See also Efficiently slicing random windows for reduced xarray dataset - #25 by rabernat for another xarray-orientated example.
Personally, I’m more in the Pytorch ecosystem, and torchdata
(see Tutorial — TorchData 0.4.1 (beta) documentation) with its composable Iterable-style DataPipe
s is the fancy new way for creating ML data pipelines. As a shameless plug, I’ve got an example tutorial at Chipping and batching data — zen3geo that walks through using DataPipe
s to load multiple GeoTIFFs with rioxarray
, chipping into 512x512 tiles with xbatcher
, and loading into a Pytorch DataLoader. Not exactly NetCDF, but the general workflow post rioxarray.open
should be reusable.
Yes, avoid reinventing the wheel if possible Asking on this forum definitely puts you on the right track. At the end of the day, you’ll definitely need to customize things for your particular dataset, but there are some core/fundamental tools that should hopefully be fairly standard for people doing things the Pangeo/xarray way.