I’m working since a while to a project that analyses time series derived from satellites to create some indices. Unfortunately, for the nature of the algorithm that is multispatial and multitemporal, there isn’t any other way than process every single pixel by its own and create a netCDF with multiple arrays as output.
In few words, dataset (read as a xarray data array) is processed row by row (to avoid memory problems), the client creates a list of valid pixels that need to be processed and scatter data and “futures” to workers.
Every single worker read the assigned pixels from the scattered data, analyse the time series and send results back as an object that contains a set of indices (organised singularly as a pd.dataframe).
Whenever results are ready, the master, extract indices from the object and reassign them to the proper cache dataframe; finally, whenever the row has been completely analysed, the cache is flushed on a netCDF file.
Just for a matter of clarification, everything has been organised and processed using pandas dataframes and series as I need to keep strict control over the time dimension. Other faster approach using numpy has been rejected as the alignment over time can be quite messy without the help of pandas.
On local machines the overall idea works pretty fine; memory doesn’t grow infinity, output objects are quickly reassigned in the cache that permits to append every single row to the output (instead than create a preassigned output file that doesn’t fit the memory) and time dimension is adequately managed.
Problems come whenever I try to scale up this approach on clusters. The time between the end of the calculation and the finalisation of the row over the netCDF takes an unacceptable time. All the attempts to understand where the bottleneck is has failed and, if I try to analyse the process locally, this doesn’t highlight anything that can help me on the cluster.
Many tentatives have been done on the infrastructure side, from an HPC to Google Clouds or internal implementation of something that’s almost the same as Pangeo to prove that’s not related to the infrastructure.
The bottleneck seems to be how I manage results and how I assign to the output.
What I would like to know is if out there is there anyone that had the same needs, pixel base analysis and multiple-output, and is using futures to process them. I would like to share my experience and eventually be inspired to take other approaches as mine seems not to be the winning one.