Favorite way to go from netCDF (&xarray) to torch/TF/Jax et al

weiji14 · August 17, 2022, 3:24am

Hi @ThomasMGeo, the answer on ‘how’ to read 10-100s of GBs of NetCDF files partly depends on whether you want to go for A) pure speed, or B) readability/metadata preservation.

If speed is the main goal, then you’ll probably want to convert those NetCDFs into a more tensor-friendly format like tfrecords, .npy files, webdataset, or so on. Zarr might be speedy as well if you can optimize the chunk sizes (which is another topic in itself).

If you like metadata and are looking at keeping things more in the xarray world (at least until the last minute when you throw things into the GPU), then xbatcher is definitely recommended for doing multi-dimensional name-based slicing or fancy indexing. Cc @maxrjones and @jhamman. See also Efficiently slicing random windows for reduced xarray dataset - #25 by rabernat for another xarray-orientated example.

Personally, I’m more in the Pytorch ecosystem, and torchdata (see Tutorial — TorchData 0.4.1 (beta) documentation) with its composable Iterable-style DataPipes is the fancy new way for creating ML data pipelines. As a shameless plug, I’ve got an example tutorial at Chipping and batching data — zen3geo that walks through using DataPipes to load multiple GeoTIFFs with rioxarray, chipping into 512x512 tiles with xbatcher, and loading into a Pytorch DataLoader. Not exactly NetCDF, but the general workflow post rioxarray.open should be reusable.

Yes, avoid reinventing the wheel if possible Asking on this forum definitely puts you on the right track. At the end of the day, you’ll definitely need to customize things for your particular dataset, but there are some core/fundamental tools that should hopefully be fairly standard for people doing things the Pangeo/xarray way.

Topic		Replies	Views
Blog post: Loading NetCDFs in TensorFlow Data	2	752	March 21, 2022
Processing large (too large for memory) xarray datasets, and writing to netcdf Science	12	6973	December 12, 2024
Create batches of random subsets of data scattered across different files Data machine-learning	2	74	January 29, 2025
Memory requirements tor converting a netcdf multifile dataset to zarr Data	3	825	May 18, 2022
Using grib2 files with `open_mfdataset`: is there a better workflow than converting to netcdf?	4	1321	July 27, 2022

Favorite way to go from netCDF (&xarray) to torch/TF/Jax et al

Related topics