We are currently computing time series for 20 different indices (e.g., NDVI, NDMI) for 70,000 locations worldwide using Sentinel-2 data (available in COG format on Microsoft) from 2015 to the present. Currently, it takes around 3 minutes to retrieve and compute the data for a single location. At this rate, processing all 70,000 locations would take approximately 145 days.
Has anyone encountered a similar issue when trying to retrieve data from many different points using Sentinel-2 data?
Some contexte :
Our workflow involves using STAC for data discovery, Stackstac for building xarray datasets, and xarray combined with Dask for computation. We then convert the final time series into a pandas DataFrame for each point.
If anyone has suggestions to speed up this process, they would be greatly appreciated!
Do you have a rough sense for where time is being spent?
Querying the STAC API (maybe the bulk access would be faster?)
Loading the data from Blob Storage (is your compute in the same region as the data? Are you saturating the network on your workers?)
Computing the indices
Writing the output
compute the data for a single location
By āsingle locationā do you mean essentially a single point? Whatās the rough size of the data youāre loading per point, and are you sure youāre loading just the data necessary (ideally just a single window out of each COG)?
Thank you for your question and feedback. I have thoroughly investigated each of your remarks.
When I refer to a āsingle location,ā it could represent either a point or a very small polygon (approximately 5 meters in size). The approximate data size for a single point is 229.02 kiB (this includes time: 2255, band: 13, y: 1, and x: 1).
Querying the STAC API
Querying the STAC API is indeed time-consuming. The time it takes to retrieve items from the STAC API for one location, including all available data from 2015 to the present, is approximately 9.21 s Ā± 613 ms per loop.
Using Stackstac to create the xarray takes around 3.58 s, but the chunking appears to be suboptimal:
Loading the Data from Blob Storage (Is your compute in the same region as the data? Are you saturating the network on your workers?)
I noticed that the S2 data is stored in the westeurope region on Azure. However, we are not using Azure for our VM, so we chose eu-west-1 on AWS. We can easily switch to a different region or even use GCP if needed.
If I perform a compute operation immediately after running Stackstac (to measure the time required to retrieve the data without any processing), it takes 2 min 51 s. This step consumes the majority of the time.
Computing the Indices
To give a rough estimate, I perform the computation after retrieving the data with a .compute() call. If we call .compute() beforehand, the process is almost instantaneous. Otherwise, it takes nearly the same amount of time as the previous step: 2 min 52 s.
Writing the Output
Writing the output is relatively fast, taking only 3 seconds. Although there is room for optimization, this is not the primary concern at the moment.
Do you consider all the time steps or do you filter out cloudy scenes beforehand using the eo:cloud_cover property? If not applied already, this could save you some time, reducing the number of Items you need to load.
Are you running the computation in a serial manner? If you have more than 1 CPU you could easily parallelize the whole computation using libraries like joblib, multiprocessing or dask
You could also think about grouping the geometries based on their location and to which Sentinel-2 tile(s) they correspond. This would allow to reduce the number of STAC API calls and reuse the same xarray object to get the info about multiple points at once.
Finally, really important, are you considering the baseline processing change that happened in 2022? Otherwise you might see strange differences in the indices you compute comparing years before 2022 and afterwards. References:
Perhaps @TomAugspurger point on Bulk download may be worth, depending on your points density.
I wonder how much searching (supposing that thereās a search involved in the STAC backend, and how such search plan is defined) and parsing individually through the API would cost; vs having it parsed locally, perhaps with a mask (without search, using a list of these points), and then applying a TS function through xarray apply_ufunc in vectorized format.
Perhaps a test with a ātoyā experiment, including some networking timing, would be a way to decide.
I completed a similar process earlier this summer, although with less points (around 3300). There are lots of good tips here, I will just add a couple little things I learned as a relative newcomer to this type of workflow when trying to make the process more efficient. Although it still took a minute or two per point, essentially doing separate STAC calls and loads in a for loop (did not know about any bulk option!). In some later processing I found it was not much more processing time to grab full tiles rather than single pixels to gather spatial information - so have been doing that with 60 x 60 km cubes instead of single pixel time-series.
Setting up a Dask LocalCluster and specifying additional workers/threads sped up loading into memory by a lot.
Increasing the limit parameter in catalog.search() sped up finding the HLS (in my case) imagery throug STAC a lot. I think this controls how many items can go into a single json āpageā in the item output. I ended up on 100, since I found going higher led to intermittent errors.
I ended up using a chunksize of (1, -1, -1, -1) (i.e., 1 time-step, all bands, full extent) since some of my processing ran on individual time-steps. Although this was more for my spatial outputs, for single pixel this may lead to a lot of small chunks.
If doing any groupby operations (say merging same-day observations), I found that using stackstacs internal mosaic function (e.g., cube.groupby('time').apply(ss.mosaic...) was fastest. You may also consider using flox on regular groupbys (e.g., cube.groupby('time').mean(engine = 'flox').
Hi, thank you very much for your detailed response.
As a first step, weāre indeed focusing on removing unused dataāspecifically, data with more than 90% cloud cover based on the eo:cloud_cover property. After that, we apply a cloud mask (using the one from ESA SLC) and harmonize the Sentinel-2 data.
All our computations are based on Xarray and Dask (weāre huge fans of Coiled, which allows us to scale the process, though itās still taking too longā¦).
Based on some of your feedback, weāve grouped the geometries by H3, then by tile . We then make a single call for the entire group, creating one (massive) xarray of over 300TB with stacstack. This approach helps reduce the number of calls, and afterward, weāre using xvec to process multiple points within the same xarray. Although we were quite confident in this new process, it seems to work only for fewer than 20 points on a single tile, even when using a cluster of around 100 VMs, each with 32GB of RAM (not exactly a small setup).
Thanks for your feedback! This is extremely interesting.
Indeed, weāve set up a Dask cluster (not a local one, but using Coiled to create it).
Weāve also rechunked the dataākeeping small chunks was definitely a bad idea.
Great tip about the groupby, thanks!
I also ended up using dask.delayed on top of our process. It crashes, but it seems to be a nice approach instead of doing a for loop for each location.
We also tried creating a large xarray for each tile in France. Then, we applied a simple datacube.sel(x=list_x, y=list_y, method='nearest'), but this required an extremely large amount of RAM (more than our 100-VM, 32GB Dask cluster could handle).
Not sure if optimization.fuse.active needs to be False. I remember that it solved an issue some while ago and it was recommended in a GitHub issue. Might not be necessary anymore.
Here is a blog post and here a thread in the Pangeo Discourse about it.
Hi,
Thanks a lot for your answer. Unfortunately, we decided to go with Earth Engine due to the timeline. We pushed the result in Bigquery for each pixels. Afterwards, we have a script that can rebuild an xarray based on the BigQuery output. For now, it was the quickest solution (we retrieved all the data in one night).
@Patrick_Hoefler from coiled helped us a lot during the process. It seems that using an Azure VM in the same region as the Sentinel-2 Blob could reduce the time required by a factor of 4. However, we were unable to reactivate our Azure account and retry the process within the timeline.
In any case, we learned a lot! Some of our scripts were not fully Dask-compliant (not a big issue at first because we were processing around 500 GB, but definitely a problem when we started working with more than 100 TB). Moreover, I was really impressed by the capabilities of xvec.
Hi, just to add a bit more to this particular point - and following the xarray docs here: Indexing and selecting data , I had some success when selecting a large amount of points by turning them into DataArrays along some other coordinates: