I am training a ConvLSTM model in Pytorch to predict vegetation condition (NDVI) based on a precipitation series (standardized, e.g. SPI). I wanna ask opinions on best practices on how to pre-process the data before training my model.
Currently I have cropped my area of interest (the Horn of Africa) and a large fraction of the image contains missing values (nan) after this operation. Considering that a Convolutional network needs pictures that have a given pixel size (h*w), my images are not regular and thus I need to deal with all the missing values that have been generated after cropping the image. Potential solutions might be:
- imputating input (precipitation) missing values with median (e.g. 0), nearest neighbor or
- interpolate nulls as in [this example]Example - Interpolate Missing Data — rioxarray 0.14.0 documentation)
- a value which doesn’t have any meaning and signals the null (e.g. -99)
- leaving the image as it is, not cropping an area of interest
The drawback of 1) is that the 0 value has a meaning in terms of the NDVI series, whereas 2) artificially transform the values on the ocean to the NDVI values for the coastal area or for the median value of NDVI. 3) Makes the RMSE explode, whereas solution 4) leaves the images with their original values but this might compromise the convolution operation for the pixels next to the coast.
I have currently stored my data in netcdf4 format and after loading data in xarray, I am converting it to numpy array format and feeding it to Pytorch DataLoader class.
I would really appreciate to hear from those that have encountered this problem in the past. Here I attach an example of my area of interest and how it looks after cropping the image.