Favorite way to go from netCDF (&xarray) to torch/TF/Jax et al

ThomasMGeo · August 17, 2022, 12:39am

Hello!

I am researching best practices to load spatial, geoscience data (generally netCDF/Grib via xarray) to various ML packages. Two things kept coming up:

xbatcher project: GitHub - pangeo-data/xbatcher: Batch generation from xarray datasets
Noah Brenowitz blog post on netCDF to Tensorflow. I copied Noah’s work here, but my graph is not quite the same netCDF2ML/noah_demo.ipynb at main · ThomasMGeo/netCDF2ML · GitHub .

Main questions are:

Are there other blogs, how-to’s, general documentation I should be aware of?
For researchers with 10-100’s of GB’s in netCDF’s per project, are their rules of thumb or ‘bad ideas’ that I should avoid?

weiji14 · August 17, 2022, 3:24am

Hi @ThomasMGeo, the answer on ‘how’ to read 10-100s of GBs of NetCDF files partly depends on whether you want to go for A) pure speed, or B) readability/metadata preservation.

If speed is the main goal, then you’ll probably want to convert those NetCDFs into a more tensor-friendly format like tfrecords, .npy files, webdataset, or so on. Zarr might be speedy as well if you can optimize the chunk sizes (which is another topic in itself).

If you like metadata and are looking at keeping things more in the xarray world (at least until the last minute when you throw things into the GPU), then xbatcher is definitely recommended for doing multi-dimensional name-based slicing or fancy indexing. Cc @maxrjones and @jhamman. See also Efficiently slicing random windows for reduced xarray dataset - #25 by rabernat for another xarray-orientated example.

Personally, I’m more in the Pytorch ecosystem, and torchdata (see Tutorial — TorchData 0.4.1 (beta) documentation) with its composable Iterable-style DataPipes is the fancy new way for creating ML data pipelines. As a shameless plug, I’ve got an example tutorial at Chipping and batching data — zen3geo that walks through using DataPipes to load multiple GeoTIFFs with rioxarray, chipping into 512x512 tiles with xbatcher, and loading into a Pytorch DataLoader. Not exactly NetCDF, but the general workflow post rioxarray.open should be reusable.

Yes, avoid reinventing the wheel if possible Asking on this forum definitely puts you on the right track. At the end of the day, you’ll definitely need to customize things for your particular dataset, but there are some core/fundamental tools that should hopefully be fairly standard for people doing things the Pangeo/xarray way.

ThomasMGeo · August 17, 2022, 3:59am

Really appreciate all the links/thoughts! Yes, main goal is to not reinvent the wheel for many reasons Thank you for putting this all together.

Today, I am more on the metadata/more in the xarray realm than pure speed, but good to know of whats out there if a project objectives/size changes that.

Best,
Thomas

RichardScottOZ · August 17, 2022, 7:56am

That’s basically my experience - the more efficiency, speed etc. you want, the rawer the data you want - so big projects, the arrays etc. will be faster.

We were actually toying with going even closer to metal raw binary data for a recent 200TB project. That would have meant rewriting some things - so there’s a speed tradeoff there too.

Are you looking at doing it a lot, repeatably, or a one-off?

rabernat · August 17, 2022, 8:05am

While this is the conventional wisdom, in the blog post below, @nbren12 showed that, in fact, netCDF can be just as fast as those other formats in ML training loops.

weiji14 · August 17, 2022, 1:14pm

I love how this discussion is steering more from metadata to speed Just to clarify, NetCDF can indeed be fast enough if you’re going from File → CPU-RAM → GPU-RAM (assuming you’ve got enough I/O, RAM, etc) as @rabernat pointed out.

Now, if you want end-to-end pure speed (and have enough GPU RAM) then the CPU-RAM to GPU-RAM data transfer will be the main bottleneck. You’ll then need to look at things like GPU direct storage:

This would involve using libraries that handle loading/pre-processing directly on the GPU like:

RAPIDS, in particular, CUCIM - GitHub - rapidsai/cucim for computer vision/image processing
NVIDIA DALI, which does data loading and augmentation on the GPU, see NVIDIA DALI Documentation — NVIDIA DALI 1.16.0 documentation

Caveat with this is that you can’t read NetCDFs or most ‘geo’ formats directly into GPU yet (as far as I’m aware). Relevant issues include:

That said, there is a way to map CPU/NumPy tensors to GPU/CuPy tensors in xarray as with cupy-xarray, and then use GPU zero-copy methods to convert CuPy tensors to Pytorch/Tensorflow tensors. See:

But again, you will still need to load the NetCDF from File → CPU-RAM → GPU-RAM until someone figures out a more direct NetCDF file → GPU-RAM path. This has been on my wishlist for quite a while, and most of the interoperability standards are in place, we just need to get some smart people to do it

dcherian · August 17, 2022, 3:24pm

Since GPU DirectStorage came up, check this out: Add Kvikio backend entrypoint by dcherian · Pull Request #10 · xarray-contrib/cupy-xarray · GitHub

@weiji14 Do you have a machine you could test this out on?

weiji14 · August 17, 2022, 4:21pm

Oo, shiny! Yes I’ve got a GPU, let me test that out

Edit: I’ve documented the installation/setup commands to try out @dcherian’s xr.open_dataset(store, engine="kvikio") proof of concept here if anyone is interested. Very hacky stuff but it works!

Edit 2: New blog post out on going from Zarr stores to GPU-backed xarray objects! Read it at Enabling GPU-native analytics with Xarray and kvikIO

Topic		Replies	Views
Blog post: Loading NetCDFs in TensorFlow Data	2	789	March 21, 2022
Best Practice for Machine Learning with Huge Datasets Data machine-learning	1	782	October 26, 2024
Need guidance on manipulating NetCDF files Data	4	397	April 22, 2024
Reading a Larger than RAM NetCDF4 using Xarray Data zarr	7	241	June 24, 2025
Pangeo Showcase: "Cloud Native Data Loaders for Machine Learning Using Zarr and Xarray" Pangeo Showcase machine-learning	6	1049	October 25, 2024

Favorite way to go from netCDF (&xarray) to torch/TF/Jax et al

Related topics