Hey all, I’m quite new to the geospatial world and currently working on a pipeline to store and process 10m Sentinel-2 imagery over large geographic areas (e.g., the entire USA) into both daily mosaics and 4D patches for ML inference. This seems like a common task, but I haven’t found a clear overview of a modern PANGEO stack and best practices for such a workflow.
I sadly missed the recent showcase on “building scalable mosaic workflows” and was not able to find a recording
Goals:
Generate daily mosaics over large areas and given multiyear timeperiods → Tens of TB of data
Allow fast reading of smaller 4D patches for ML inference, the shape is (bands, time, height, width). The patches have a small spatial dimension e.g 2x2km. However, they have a large temporal dimension (often more than a year).
Fast parallel downloading and allowing for fast read access to A) daily mosaics B) inference patches
My current approach:
Querying STAC (Element84 on AWS) for Sentinel-2 tile metadata (BoundingBox of ROI + timespan)
Using Python multiprocessing.Pool with xarray and rasterio to download/reproject/resample each band per item in parallel to local disk from the COGs stored in AWS. Saving as .tif
Creating VRT mosaics per band/day by grouping items using gdal buildvrt
Not yet creating any inference patches
Questions
Are there any well established tools/pipelines for such a usecase?
I’ve looked into stackstac and odc-stac but haven’t managed to set them up efficiently (possibly due to incorrect Dask configuration). Also wasn"t sure if they are designed for such a use case with wide (spatially) and deep (temporal) data.
Should I precompute and store patches (e.g., one file per patch), or is there a storage format that enables efficient slicing of 4D patches directly from daily mosaics? (e.g., Zarr…)
Also, I have access to a large HPC cluster (100+ cores, >1TB RAM), so compute/memory aren’t that limiting.
Any advice or examples would be super helpful! Thanks!
I sadly missed the recent showcase on “building scalable mosaic workflows” and was not able to find a recording
Check the other thread again. The recording is available
Should I precompute and store patches (e.g., one file per patch), or is there a storage format that enables efficient slicing of 4D patches directly from daily mosaics? (e.g., Zarr…)
Maybe xbatcher could be useful for your goal of creating 4D patches for inference. And yes, I’d say Zarr is the storage format to go for.
I’ve looked into stackstac and odc-stac but haven’t managed to set them up efficiently (possibly due to incorrect Dask configuration). Also wasn"t sure if they are designed for such a use case with wide (spatially) and deep (temporal) data.
Here is a discussion about the differences between both libraries. If you share a code example, I’m sure people here would be happy to point out issues that might be related to the inefficient set up you’re describing.
I believe there may be some useful info in this other pangeo post HLS time series using xarray best practices (mostly about some common issues about processing long time series over large extents with RS data).
In short, it may be that you need to exploit some structure of the data to deal with it more efficiently and not give dask a nightmare graph. For instance, you could process it for each Sentinel-2 MGRS (military grid reference system) tile. Also, Sentinel-2 has an average revisit time of about 5 days, so trying to do daily mosaics will have a lot of nans around, so depending on your end requirements, you can make your processing better match the underlying structure of the data.
This blog post, video, and repo from the Earthmover team outlining a scalable pipeline for building a zarr datacube from S2 data, very similar to what you are describing. Once you have this datacube xbatcher would be well suited for serving the ML-ready patches (here is another Earthmover blog that demos this)
The only major change would be that the earthmover demo used serverless runners (lithops, modal, coiled) to scale the workload, not dask (which would be suited to your HPC compute). I have a fork of this repo that uses dask for scaling. It is not production-ready but it would get you most of the way. This is the relevant file for setting up the dask runner.
Those are some great resources thanks a lot!
I will explore @glennwithtwons dask-adapted earthmover demo in combination with xbatcher, which sounds like a promising way to go and post an update on my final implementation for reference.
However, I am still a bit unsure as for how to best organize and store the data:
I expect to frequently load data in two shapes and need efficient reads:
A: single day national-scale mosaic (no additional time-dimension) ( the data is ingested in single day mosaics into the zarr).
B: ML-ready infererence patches of e.g 5km x 5km x 365 days
I have never worked with zarr data so far, so I am a bit unsure if I am able to optimize for these two different access patterns in zarr. My intuition would be to, take the always take the smallest value in each dimension for chunking to avoid unnecessary data loading, as I want to avoid data loading becoming a bottleneck for inference. So chunking would be in roughly (1day x 5km x ,5km). Do you have any recommendation on how to define chunks (and possibly shards) if requiring such two different access patterns as in my use case? Is zarr the way to go, or might there be a more suitable option? I read through quite some zarr-related threads, but was unable to find a similar example. Thanks!