Create batches of random subsets of data scattered across different files

Hi all,
I am currently dealing with several ML projects with a similar problem:
My training data is scattered across 100s of netcdf files and I need batches of random subsets from all those files. E.g., say I have 100 files, each file contains several variables on a 128x128 grid but my training samples should be batches of 16 samples, 32x32 each, randomly picked from the data across all files.
Until now I was preprocessing all data to pickled torch tensors of 32x32 that I could then directly select randomly with a torch dataloader. I am now aiming to put that whole process into one pipeline that picks a subset from one of the files, does some pre-processing to it, and assembles it with others to a batch that can be used in pytorch. This blog post. which was discussed a lot on this forum, seems to be covering many of the crucial steps needed (efficient parallel loading, batching using xbatcher, etc.). What I am still missing is

  1. how to batch randomly (AFAIK xbatcher still mentions shuffling in the Roadmap, meaning it is not supported, yet).
  2. How to do this across a large amount of files. So far, the only way I see is to use xr.open_mfdataset before passing this to xbatcher. Still, many posts report problems with xr.open_mfdataset for large amounts of files. Apart from that, I am not sure if xbatcher could, e.g., batch in lat and lon, if the dataset contains discontinous coordinates (some part of the mfdataset coming from one region/file, another coming from somewhere else).

Before I start coding something myself I would like to know if I am missing some package or repo that already provides what I need, or if the aforementioned assumptions of the existing packages are not up-to-date or wrong.

Thanks for your help!

1 Like

Hello,

To your first question, you can have random batches with xbatcher by passing shuffle=True to the DataLoader it connects with. My understanding is:

  • torch Datasets map from an integer index → a training example.
  • xbatcher helps you go from the training example index → a slice of your xarray. Then it’s on you to turn that slice into tensors your model can train on.

So if the training example indices are randomly chosen, you get random batches. Setting shuffle=True in the DataLoader constructor should accomplish that. I believe that earthmover post has this setup.

To your second question, I’m dealing with a similar problem (very sparsely populated xarray). The solution I ended up using was to precompute all indices in my array that had valid windows. Will any of your 32x32 tensors span multiple files? If not, you could iterate over each file, store the positions of valid windows, and then randomly pull windows during training. If the example index is randomized, then you would be pulling data from multiple files in each batch.

The disadvantage of that approach is you take a full pass through your data before you can start training. The “warmup” time can be a few minutes, but once done the training loop is very performant.

I am still working through the best way to do this, a version that works for a single array is here. Maybe that’s helpful?

Thanks, I must have missed that shuffle flag.
Still, it seems that extracting shuffled windows across different files would be difficult with xbatcher.
In my case, windows won’t span multiple files, but the number of windows per file can differ. This means I also need to scan my files before training. So far I started writing a file indexer class which will do something similar your code example, i.e. check the size of all files in the training directory and determine the amount and positions of valid windows per file (plus some more sanity checks).
I haven’t gotten to using this during training, but good to hear that it performs well in your case. I’ll report once I get there.