How to deal with cropped image when trying to predict NDVI using Convolutional Networks?

I am training a ConvLSTM model in Pytorch to predict vegetation condition (NDVI) based on a precipitation series (standardized, e.g. SPI). I wanna ask opinions on best practices on how to pre-process the data before training my model.

Currently I have cropped my area of interest (the Horn of Africa) and a large fraction of the image contains missing values (nan) after this operation. Considering that a Convolutional network needs pictures that have a given pixel size (h*w), my images are not regular and thus I need to deal with all the missing values that have been generated after cropping the image. Potential solutions might be:

  1. imputating input (precipitation) missing values with median (e.g. 0), nearest neighbor or
  2. interpolate nulls as in [this example]:frowning:Example - Interpolate Missing Data — rioxarray 0.14.0 documentation)
  3. a value which doesn’t have any meaning and signals the null (e.g. -99)
  4. leaving the image as it is, not cropping an area of interest

The drawback of 1) is that the 0 value has a meaning in terms of the NDVI series, whereas 2) artificially transform the values on the ocean to the NDVI values for the coastal area or for the median value of NDVI. 3) Makes the RMSE explode, whereas solution 4) leaves the images with their original values but this might compromise the convolution operation for the pixels next to the coast.

I have currently stored my data in netcdf4 format and after loading data in xarray, I am converting it to numpy array format and feeding it to Pytorch DataLoader class.

I would really appreciate to hear from those that have encountered this problem in the past. Here I attach an example of my area of interest and how it looks after cropping the image.

1 Like

This is actually the kind of problem I like working on. I usually work with time series with irregular gaps in space and time, and more recently I’ve been working on ocean data while trying to avoid the pitfalls of excluding near-coast points.

My intuition is that you generally want to avoid interpolation/“imputation”/etc methods if you can. It can create over- or under-representation of the underlying information in ways that are hard to predict, among other potential issues. As far as pre-processing, I would do nothing special :slight_smile:

However, we need to characterize exactly what kind of gappiness your data has. It sounds to me like your NaN mask is constant in time, is that correct?

If so, this problem is actually quite similar to the one I’m working on now, where I am working with a graph convolutional neural network. In your case, I think you would want to pose the problem like this: given a time series of graphs of a constant shape, can we learn some variable of interest?

If that sound like an interesting path to you, I have some code to automatically convert xarray data to networkx graphs, which are easy to load into pytorch AFAIK.


@Chris_Dupuis thanks for your suggestion. I haven’t thought about GNNs before but it seems a clever way of dealing with the problem. It would be awesome if you could share your code with me.