Best Practices for automating large scale Sentinel dataset building and Machine Learning?

Last year, one of our projects was this:- [Copa de Cobre | Unearthed]

Basically, take a Sentinel satellite Mosaic and make a geological map out of it, generally with Machine Learning.

“So the scale of the problem: At 4 bytes to a word, our Sentinel-2 mosaic of the geologically interesting part of Peru is 132572 columns by 157588 rows by 10 spectral bands, each a 2-byte spectral reflectance from the deep blue to the shortwave infrared. Or around half a terabyte. Have similar scale problems with states of Australia, being of similar size, or multiples of.”.

I would be interested in people thoughts on developing a less laborious workflow, so that the analysts/consultants have more time to spend on investigation and modelling and science as opposed to data wrangling.

Starting with a bounding box for example, for South Australia (or a polygon outline) and having the data automatically retrieved, e.g. to S3. To some specification to start with, most cloud free, most vegetation free or whatever else might be feasible, those being two obvious things for geology. Tell it to start Friday and see where you are at Monday type of things.

Then you end up with raster data that becomes machine learning data. Which is maybe ok at this level - but not necessarily if we wanted to work on all of Brazil. Mentioning this sort of thing to AWS data scientists/engineers has caused their eyes to widen, too - even referring back to Pangeo.

If you make all this Zarr, for example - Zarr in what format - a store? Given the numbers above can be many millions of files as a standard Zarr dataset, which is not ideal at max 1000 keys. Best sizing?

Therefore, suggestions for training and inference, please?. The latter, if geologists want high resolution maps - for example, the 10m scale of some Sentinel data is a multibillion prediction problem. Time and computing heavy.

Can you do all that in object storage without too big an overhead penalty? Minor delays * several hundred billion isn’t good.

Anyone been working on multibillion scale clustering problems?

Thanks very much,


  1. Find a nice STAC catalogue hopefully
1 Like
  1. Extract metadata - work out what to do with multiple CRS

add a dimension equal to the projection? zone = 53 etc.

1 Like

- a Gawler Craton example - one band and I haven’t bothered doing a utmzone split, as just an illustration

1 Like

:star_struck: @bluetyson that is a big dataset!

1 Like

Yes - and South Australia entire would be a bunch bigger. So need to work out a strategy which is presumably going to require some sort of cluster of ‘need more memory get more’ capability.

Yah that’s a ton of data.

We have a similar source of data…a high resolution model climate model where each snapshot of the model state is a few TBs, but we never touch the full resolution data, and always work with spatially downsampled data. The full res data is not even saved to disk, since we do the coarse-graining online.

Yes, the Peru median datasource we put together is around 0.5TB.

What solution do you use for the working with the high resolution?


  • 10000*10000 (40 time median) happily works on my laptop (work one, admittedly pretty good).

Sentinel blue band for local area.

Here’s the example:- dask-era5/PANGEO-Sentinel-Intake-SA-OZLocation.ipynb at main · RichardScottOZ/dask-era5 · GitHub

Came across this while browsing today - different use cases I think

A COG In The Machine - Using Cloud Optimised GeoTiffs to Query 24 Billion Pixels In Real-Time - TIB AV-Portal - A COG In The Machine - Using Cloud Optimised GeoTiffs to Query 24 Billion Pixels In Real-Time - FOSS4G Bucharest 2019

1 Like

Tomas kindly sent me a message and pointed me to this blog:- PostGIS Raster and Crunchy Bridge

This one from @zflamig A 5TB Dask example from AWS - rainfall data Jupyter Notebook Viewer

I ran a couple of tests thanks to @zflamig having a great cloud formation template that spins up a Jupyter/Fargate Cluster combination at zflamig/dask-era5 (

and here for a Dockerfile to tweak the workers zflamig/dask-worker ( - e.g. added intake, intake-stac, sat-search, geopands and rioxarray - which I also installed in the base environment the template created, once I realised that one was conda_daskpy3 in the list from SageMaker in the JupyterLab.

Still using basically @scottyhq COG best practices algorithm - go through a stac catalog, add appropriate scenes to a list, compute list and concatenate.

I trialled up to 300 workers (8GB/1CPU). Will see what costs before trying orders of magnitude higher.
Tested on SA UTM zone 52 - 116 scenes 100000*50000 approx for one 10m resolution band.

Ballpark lambda calculations by some locals = very expensive so far I think in comparison, plus writing the code, although there is a geolambda repository that may help there.

Probably still a case of expert technology advice to get there.

e.g. here :- dask-era5/pangeoFargateTrialSA.ipynb at main · RichardScottOZ/dask-era5 · GitHub

Vincent Sarago from DevelopmentSeed sent me this on twitter this morning :- COG Talk 4ter — Distributed processes | by Vincent Sarago | Development Seed | Medium

1 Like

Hi Richard,

I don’t think I can really advance your question, but I wanted to at least say hello. :wink:

I’m fairly new to ML on geospatial data so have been getting to grips with xarray and dask in that context over the past few months. I’ve cobbled together one workflow that takes Sentinel tiles as xarray datasets, stacks them, and then applies the sklearn model via a call to xarray’s apply_ufunc. This outputs class probabilities and labels which are saved to a netCDF file (when you call the compute method on the delayed writer). I won’t claim it’s pretty and I’ve had to juggle chunk sizes to manage constraints on the scheduler and workers (I have a single machine - I know, I know), but it’s enabled me to process complete Sentinel tiles. Not really what you’re asking about, I think, but I look forward to more discussions with you and others on here.



Hi Guy, that sounds good! Be interesting to hear your thoughts on chunk sizes. What are you modelling? This sounds pretty useful for doing prediction, definitely.

Ah, chunk size rightsizing has been “fun”. My workflow’s divided into two main parts. There’s the training data intersection and then there’s the model building and prediction. The training data intersection isn’t too bad (he says glossing over what it took to get to that point) and I just run a “default” 1000x1000 chunksize (not chunking in the variables direction). The prediction stage is a bit more fiddly and I drop that down to 500x500 else I run out of memory. But this is a balancing act to avoid too small chunks giving rise to too many tasks and giving the scheduler a migraine. On that front, I’m assuming that there is a limit to the area that can be processed (in general using xarray and dask) because you either need more memory on the workers for larger chunks, or else more memory/work for the scheduler to manage more tasks. But on my single machine I can just about do a whole S2 tile at a time.

As for what I’m modelling, whatever the job in hand. I’ve mostly been doing some crop ID work (so the ground truth takes the form of relatively sparse, isolated fields). But we also work in the forestry sector and infrastructure.


For one machine testing:-

On a 128/12 portable workstation I could do one Sentinel area - 2048*2048 chunks for a 40 datetime series for 4 10m bands at once. So 10K * 10K.

On a 768GB machine I could do one 20m band for 1/4 of South Australia for one UTM zone. Something like 30K*50K ballpark.