Shuffling and windowing an xarray dataset for machine learning

ganzk · January 3, 2025, 7:58pm

Hello,

In my work I’m struggling with providing data from xarray to a machine learning model. I’m aware of tools like xbatcher in this blog post and this other thread. I run into two main sticking points:

Randomly shuffling examples is very important. We get vastly different model performance depending on the order data is provided during training (see notebook).
Constructing windowed data with xarray is very memory intensive. If I want to slice out all non-NA windows of a certain size, I have to iterate through small chunks of the data (this approach was the solution in the thread linked above).

My process now is to write an intermediate dataset for a given window size, drop NAs, shuffle, then train a model. This works, but then every time I want to modify the input data (e.g. try a 5x5 window instead of 3x3) I have to write a new intermediate dataset.

My data is only in the 10s of GB range, so if I’m struggling at this scale I can’t help but think there is a better way to provide data to a model. Has anyone on this forum has had more success in the time since that earthmover blog post was published?

Happy to provide more details (and a toy dataset) if that is useful.

ThomasMGeo · January 3, 2025, 8:34pm

Is it time or geographic coordinates you want to shuffle on?

And if your dataset is under 80GB or so, easy enough to fit entirely in memory. Wondering where the bottle neck is.

ganzk · January 4, 2025, 1:06am

Both geographic and time coordinates. Generally I find that the initial dataset and the windowed dataset without NAs can fit in memory, but the immediate result of calling Dataset.rolling(...).construct(...) cannot.

If I’m constructing a 5x5 window, the resulting array will eat 25x more space (even if the vast majority of those windows are NA and get thrown out). So if my dataset is 10 GB, the windowed dataset quickly gets too large to fit in memory. My workaround was to iterate over chunks in x/y/time and do windowing on each chunk. In my case I suppose I can just hold the model-ready data in an in-memory array instead of writing to disk.

But what if the result of slicing/windowing/etc. is too big to fit in memory? In that case, wouldn’t we have to write an intermediate dataset? I’m not really “stuck” on anything here, just wondering if there is a better way than what I am already doing. And, if someone has found a way to do this workflow without an intermediate, I might get better performance doing something similar.

ThomasMGeo · January 4, 2025, 2:04am

I would want to dig into your windowed dataset to see how much really is nan’s, and how your constructing that. Be good to have a notebook with a test dataset to share here.

shahinmg · January 4, 2025, 9:07pm

Maybe Zen3Geo by @weiji14 might help?

ganzk · January 6, 2025, 6:50pm

I put a test dataset up on zenodo and a notebook showing what I’m doing right now is available here. Around 80% of the raw data is NA. Once windowed, that proportion goes up to around 90%. These are polygons from aerial surveys that I am converting to raster on my own. The missingness results from the geometry of where the surveys happen, not underlying data quality.

Thanks for the mention of Zen3Geo, I had not seen that library before. I see that it wraps xbatcher, I’ll have to check it out.

ThomasMGeo · January 13, 2025, 5:25pm

I don’t think there is a secret way to store random nan’s, unless it’s the majority (90%+++) or only in the diagonal.

I wrote a little comparison of how to do it in one step, but total memory load is still a problem, and super similar.

ganzk · January 13, 2025, 7:36pm

Thanks for that notebook, especially the memory monitor strategy. That notebook implies that memory usage peaks at nearly 200x that of the original array! I was able to get things working much better by omitting the stack step (see script). If I can get the indices of valid windows, then I can just pull data out of the array as needed during training instead of doing a reshape. I suspect that calling construct makes a view into the array without copying data, but stack triggers a copy.

On the small dataset finding all the valid indices has peak memory usage about 2x the original array. On the full dataset it’s around 5x. Much more workable than 200x

Appreciate your input!

ThomasMGeo · January 13, 2025, 7:42pm

Nice find! And yeah, I think your right.

Topic		Replies	Views
Efficiently slicing random windows for reduced xarray dataset	27	2210	August 5, 2022
Create batches of random subsets of data scattered across different files Data machine-learning	2	87	January 29, 2025
Processing large (too large for memory) xarray datasets, and writing to netcdf Science	12	7181	December 12, 2024
Favorite way to go from netCDF (&xarray) to torch/TF/Jax et al Data location-ncar , machine-learning	7	5358	August 17, 2022
Memory requirements tor converting a netcdf multifile dataset to zarr Data	3	834	May 18, 2022

Shuffling and windowing an xarray dataset for machine learning

Related topics