Compute time series for 70,000 locations (Speed up the processing)

Hello everyone,

We are currently computing time series for 20 different indices (e.g., NDVI, NDMI) for 70,000 locations worldwide using Sentinel-2 data (available in COG format on Microsoft) from 2015 to the present. Currently, it takes around 3 minutes to retrieve and compute the data for a single location. At this rate, processing all 70,000 locations would take approximately 145 days.

Has anyone encountered a similar issue when trying to retrieve data from many different points using Sentinel-2 data?

Some contexte :
Our workflow involves using STAC for data discovery, Stackstac for building xarray datasets, and xarray combined with Dask for computation. We then convert the final time series into a pandas DataFrame for each point.

If anyone has suggestions to speed up this process, they would be greatly appreciated! :star_struck:

1 Like

Do you have a rough sense for where time is being spent?

  1. Querying the STAC API (maybe the bulk access would be faster?)
  2. Loading the data from Blob Storage (is your compute in the same region as the data? Are you saturating the network on your workers?)
  3. Computing the indices
  4. Writing the output

compute the data for a single location

By ā€œsingle locationā€ do you mean essentially a single point? Whatā€™s the rough size of the data youā€™re loading per point, and are you sure youā€™re loading just the data necessary (ideally just a single window out of each COG)?

1 Like

Thank you for your question and feedback. I have thoroughly investigated each of your remarks.

When I refer to a ā€œsingle location,ā€ it could represent either a point or a very small polygon (approximately 5 meters in size). The approximate data size for a single point is 229.02 kiB (this includes time: 2255, band: 13, y: 1, and x: 1).

  1. Querying the STAC API

    Querying the STAC API is indeed time-consuming. The time it takes to retrieve items from the STAC API for one location, including all available data from 2015 to the present, is approximately 9.21 s Ā± 613 ms per loop.

Using Stackstac to create the xarray takes around 3.58 s, but the chunking appears to be suboptimal:

  1. Loading the Data from Blob Storage (Is your compute in the same region as the data? Are you saturating the network on your workers?)

I noticed that the S2 data is stored in the westeurope region on Azure. However, we are not using Azure for our VM, so we chose eu-west-1 on AWS. We can easily switch to a different region or even use GCP if needed.

If I perform a compute operation immediately after running Stackstac (to measure the time required to retrieve the data without any processing), it takes 2 min 51 s. This step consumes the majority of the time.

  1. Computing the Indices

To give a rough estimate, I perform the computation after retrieving the data with a .compute() call. If we call .compute() beforehand, the process is almost instantaneous. Otherwise, it takes nearly the same amount of time as the previous step: 2 min 52 s.

  1. Writing the Output

Writing the output is relatively fast, taking only 3 seconds. Although there is room for optimization, this is not the primary concern at the moment.

Hi @Basile_Goussard, I have some additional remarks:

  • Do you consider all the time steps or do you filter out cloudy scenes beforehand using the eo:cloud_cover property? If not applied already, this could save you some time, reducing the number of Items you need to load.
  • Are you running the computation in a serial manner? If you have more than 1 CPU you could easily parallelize the whole computation using libraries like joblib, multiprocessing or dask
  • You could also think about grouping the geometries based on their location and to which Sentinel-2 tile(s) they correspond. This would allow to reduce the number of STAC API calls and reuse the same xarray object to get the info about multiple points at once.
  • Finally, really important, are you considering the baseline processing change that happened in 2022? Otherwise you might see strange differences in the indices you compute comparing years before 2022 and afterwards. References:
1 Like

Perhaps @TomAugspurger point on Bulk download may be worth, depending on your points density.

I wonder how much searching (supposing that thereā€™s a search involved in the STAC backend, and how such search plan is defined) and parsing individually through the API would cost; vs having it parsed locally, perhaps with a mask (without search, using a list of these points), and then applying a TS function through xarray apply_ufunc in vectorized format.

Perhaps a test with a ā€œtoyā€ experiment, including some networking timing, would be a way to decide.

I completed a similar process earlier this summer, although with less points (around 3300). There are lots of good tips here, I will just add a couple little things I learned as a relative newcomer to this type of workflow when trying to make the process more efficient. Although it still took a minute or two per point, essentially doing separate STAC calls and loads in a for loop (did not know about any bulk option!). In some later processing I found it was not much more processing time to grab full tiles rather than single pixels to gather spatial information - so have been doing that with 60 x 60 km cubes instead of single pixel time-series.

  1. Setting up a Dask LocalCluster and specifying additional workers/threads sped up loading into memory by a lot.
  2. Increasing the limit parameter in catalog.search() sped up finding the HLS (in my case) imagery throug STAC a lot. I think this controls how many items can go into a single json ā€œpageā€ in the item output. I ended up on 100, since I found going higher led to intermittent errors.
  3. I ended up using a chunksize of (1, -1, -1, -1) (i.e., 1 time-step, all bands, full extent) since some of my processing ran on individual time-steps. Although this was more for my spatial outputs, for single pixel this may lead to a lot of small chunks.
  4. If doing any groupby operations (say merging same-day observations), I found that using stackstacs internal mosaic function (e.g., cube.groupby('time').apply(ss.mosaic...) was fastest. You may also consider using flox on regular groupbys (e.g., cube.groupby('time').mean(engine = 'flox').

Hopefully some of this is useful for you!

Hi, thank you very much for your detailed response.

As a first step, weā€™re indeed focusing on removing unused dataā€”specifically, data with more than 90% cloud cover based on the eo:cloud_cover property. After that, we apply a cloud mask (using the one from ESA SLC) and harmonize the Sentinel-2 data.

All our computations are based on Xarray and Dask (weā€™re huge fans of Coiled, which allows us to scale the process, though itā€™s still taking too longā€¦).

Based on some of your feedback, weā€™ve grouped the geometries by H3, then by tile :sweat_smile:. We then make a single call for the entire group, creating one (massive) xarray of over 300TB with stacstack. This approach helps reduce the number of calls, and afterward, weā€™re using xvec to process multiple points within the same xarray. Although we were quite confident in this new process, it seems to work only for fewer than 20 points on a single tile, even when using a cluster of around 100 VMs, each with 32GB of RAM (not exactly a small setup).

Hi,

Thanks for your feedback! This is extremely interesting.
Indeed, weā€™ve set up a Dask cluster (not a local one, but using Coiled to create it).
Weā€™ve also rechunked the dataā€”keeping small chunks was definitely a bad idea.
Great tip about the groupby, thanks! :smile:

I also ended up using dask.delayed on top of our process. It crashes, but it seems to be a nice approach instead of doing a for loop for each location.

We also tried creating a large xarray for each tile in France. Then, we applied a simple datacube.sel(x=list_x, y=list_y, method='nearest'), but this required an extremely large amount of RAM (more than our 100-VM, 32GB Dask cluster could handle).

Iā€™ve had good experiences with Daskā€™s P2P shuffling to reduce RAM usage. Might be worth trying by adjusting the config:

import dask

dask.config.set({"array.rechunk.method": "p2p"})
dask.config.set({"optimization.fuse.active": False})

Not sure if optimization.fuse.active needs to be False. I remember that it solved an issue some while ago and it was recommended in a GitHub issue. Might not be necessary anymore.

Here is a blog post and here a thread in the Pangeo Discourse about it.

Thanks for the note about dask.delayed! I have never used it but will add to my ToDos to try out and compare with my current approachā€¦