In my work I’m struggling with providing data from xarray to a machine learning model. I’m aware of tools like xbatcher in this blog post and this other thread. I run into two main sticking points:
Randomly shuffling examples is very important. We get vastly different model performance depending on the order data is provided during training (see notebook).
Constructing windowed data with xarray is very memory intensive. If I want to slice out all non-NA windows of a certain size, I have to iterate through small chunks of the data (this approach was the solution in the thread linked above).
My process now is to write an intermediate dataset for a given window size, drop NAs, shuffle, then train a model. This works, but then every time I want to modify the input data (e.g. try a 5x5 window instead of 3x3) I have to write a new intermediate dataset.
My data is only in the 10s of GB range, so if I’m struggling at this scale I can’t help but think there is a better way to provide data to a model. Has anyone on this forum has had more success in the time since that earthmover blog post was published?
Happy to provide more details (and a toy dataset) if that is useful.
Both geographic and time coordinates. Generally I find that the initial dataset and the windowed dataset without NAs can fit in memory, but the immediate result of calling Dataset.rolling(...).construct(...) cannot.
If I’m constructing a 5x5 window, the resulting array will eat 25x more space (even if the vast majority of those windows are NA and get thrown out). So if my dataset is 10 GB, the windowed dataset quickly gets too large to fit in memory. My workaround was to iterate over chunks in x/y/time and do windowing on each chunk. In my case I suppose I can just hold the model-ready data in an in-memory array instead of writing to disk.
But what if the result of slicing/windowing/etc. is too big to fit in memory? In that case, wouldn’t we have to write an intermediate dataset? I’m not really “stuck” on anything here, just wondering if there is a better way than what I am already doing. And, if someone has found a way to do this workflow without an intermediate, I might get better performance doing something similar.
I would want to dig into your windowed dataset to see how much really is nan’s, and how your constructing that. Be good to have a notebook with a test dataset to share here.
I put a test dataset up on zenodo and a notebook showing what I’m doing right now is available here. Around 80% of the raw data is NA. Once windowed, that proportion goes up to around 90%. These are polygons from aerial surveys that I am converting to raster on my own. The missingness results from the geometry of where the surveys happen, not underlying data quality.
Thanks for the mention of Zen3Geo, I had not seen that library before. I see that it wraps xbatcher, I’ll have to check it out.
Thanks for that notebook, especially the memory monitor strategy. That notebook implies that memory usage peaks at nearly 200x that of the original array! I was able to get things working much better by omitting the stack step (see script). If I can get the indices of valid windows, then I can just pull data out of the array as needed during training instead of doing a reshape. I suspect that calling construct makes a view into the array without copying data, but stack triggers a copy.
On the small dataset finding all the valid indices has peak memory usage about 2x the original array. On the full dataset it’s around 5x. Much more workable than 200x