Best Practices for automating large scale Sentinel dataset building and Machine Learning?

RichardScottOZ · January 15, 2021, 6:05am

Last year, one of our projects was this:- [Copa de Cobre | Unearthed]

Basically, take a Sentinel satellite Mosaic and make a geological map out of it, generally with Machine Learning.

“So the scale of the problem: At 4 bytes to a word, our Sentinel-2 mosaic of the geologically interesting part of Peru is 132572 columns by 157588 rows by 10 spectral bands, each a 2-byte spectral reflectance from the deep blue to the shortwave infrared. Or around half a terabyte. Have similar scale problems with states of Australia, being of similar size, or multiples of.”.

I would be interested in people thoughts on developing a less laborious workflow, so that the analysts/consultants have more time to spend on investigation and modelling and science as opposed to data wrangling.

Starting with a bounding box for example, for South Australia (or a polygon outline) and having the data automatically retrieved, e.g. to S3. To some specification to start with, most cloud free, most vegetation free or whatever else might be feasible, those being two obvious things for geology. Tell it to start Friday and see where you are at Monday type of things.

Then you end up with raster data that becomes machine learning data. Which is maybe ok at this level - but not necessarily if we wanted to work on all of Brazil. Mentioning this sort of thing to AWS data scientists/engineers has caused their eyes to widen, too - even referring back to Pangeo.

If you make all this Zarr, for example - Zarr in what format - a store? Given the numbers above can be many millions of files as a standard Zarr dataset, which is not ideal at max 1000 keys. Best sizing?

Therefore, suggestions for training and inference, please?. The latter, if geologists want high resolution maps - for example, the 10m scale of some Sentinel data is a multibillion prediction problem. Time and computing heavy.

Can you do all that in object storage without too big an overhead penalty? Minor delays * several hundred billion isn’t good.

Anyone been working on multibillion scale clustering problems?

Thanks very much,

Richard

bluetyson · January 18, 2021, 1:26am

Find a nice STAC catalogue hopefully

bluetyson · January 18, 2021, 2:43am

Extract metadata - work out what to do with multiple CRS

add a dimension equal to the projection? zone = 53 etc.

bluetyson · January 19, 2021, 1:34am

- a Gawler Craton example - one band and I haven’t bothered doing a utmzone split, as just an illustration

rabernat · January 19, 2021, 3:51am

@bluetyson that is a big dataset!

RichardScottOZ · January 19, 2021, 4:02am

Yes - and South Australia entire would be a bunch bigger. So need to work out a strategy which is presumably going to require some sort of cluster of ‘need more memory get more’ capability.

nbren12 · January 20, 2021, 4:24am

Yah that’s a ton of data.

We have a similar source of data…a high resolution model climate model where each snapshot of the model state is a few TBs, but we never touch the full resolution data, and always work with spatially downsampled data. The full res data is not even saved to disk, since we do the coarse-graining online.

RichardScottOZ · January 20, 2021, 4:47am

Yes, the Peru median datasource we put together is around 0.5TB.

What solution do you use for the working with the high resolution?

RichardScottOZ · January 21, 2021, 1:21am

10000*10000 (40 time median) happily works on my laptop (work one, admittedly pretty good).

Sentinel blue band for local area.

Here’s the example:- dask-era5/PANGEO-Sentinel-Intake-SA-OZLocation.ipynb at main · RichardScottOZ/dask-era5 · GitHub

RichardScottOZ · January 21, 2021, 5:56am

Came across this while browsing today - different use cases I think

RichardScottOZ · January 21, 2021, 8:22am

A COG In The Machine - Using Cloud Optimised GeoTiffs to Query 24 Billion Pixels In Real-Time - TIB AV-Portal - A COG In The Machine - Using Cloud Optimised GeoTiffs to Query 24 Billion Pixels In Real-Time - FOSS4G Bucharest 2019

bluetyson · January 21, 2021, 10:55pm

Tomas kindly sent me a message and pointed me to this blog:- PostGIS Raster and Crunchy Bridge

bluetyson · January 21, 2021, 10:56pm

This one from @zflamig A 5TB Dask example from AWS - rainfall data Jupyter Notebook Viewer

bluetyson · January 22, 2021, 6:22am

I ran a couple of tests thanks to @zflamig having a great cloud formation template that spins up a Jupyter/Fargate Cluster combination at zflamig/dask-era5 (github.com)

and here for a Dockerfile to tweak the workers zflamig/dask-worker (github.com) - e.g. added intake, intake-stac, sat-search, geopands and rioxarray - which I also installed in the base environment the template created, once I realised that one was conda_daskpy3 in the list from SageMaker in the JupyterLab.

Still using basically @scottyhq COG best practices algorithm - go through a stac catalog, add appropriate scenes to a list, compute list and concatenate.

I trialled up to 300 workers (8GB/1CPU). Will see what costs before trying orders of magnitude higher.
Tested on SA UTM zone 52 - 116 scenes 100000*50000 approx for one 10m resolution band.

Ballpark lambda calculations by some locals = very expensive so far I think in comparison, plus writing the code, although there is a geolambda repository that may help there.

Probably still a case of expert technology advice to get there.

RichardScottOZ · January 22, 2021, 6:48am

e.g. here :- dask-era5/pangeoFargateTrialSA.ipynb at main · RichardScottOZ/dask-era5 · GitHub

RichardScottOZ · January 26, 2021, 1:10am

Vincent Sarago from DevelopmentSeed sent me this on twitter this morning :- COG Talk 4ter — Distributed processes | by Vincent Sarago | Development Seed | Medium

Guy_Maskall · January 27, 2021, 12:53pm

Hi Richard,

I don’t think I can really advance your question, but I wanted to at least say hello.

I’m fairly new to ML on geospatial data so have been getting to grips with xarray and dask in that context over the past few months. I’ve cobbled together one workflow that takes Sentinel tiles as xarray datasets, stacks them, and then applies the sklearn model via a call to xarray’s apply_ufunc. This outputs class probabilities and labels which are saved to a netCDF file (when you call the compute method on the delayed writer). I won’t claim it’s pretty and I’ve had to juggle chunk sizes to manage constraints on the scheduler and workers (I have a single machine - I know, I know), but it’s enabled me to process complete Sentinel tiles. Not really what you’re asking about, I think, but I look forward to more discussions with you and others on here.

Regards,
Guy

bluetyson · January 27, 2021, 7:53pm

Hi Guy, that sounds good! Be interesting to hear your thoughts on chunk sizes. What are you modelling? This sounds pretty useful for doing prediction, definitely.

Guy_Maskall · January 28, 2021, 9:48am

Ah, chunk size rightsizing has been “fun”. My workflow’s divided into two main parts. There’s the training data intersection and then there’s the model building and prediction. The training data intersection isn’t too bad (he says glossing over what it took to get to that point) and I just run a “default” 1000x1000 chunksize (not chunking in the variables direction). The prediction stage is a bit more fiddly and I drop that down to 500x500 else I run out of memory. But this is a balancing act to avoid too small chunks giving rise to too many tasks and giving the scheduler a migraine. On that front, I’m assuming that there is a limit to the area that can be processed (in general using xarray and dask) because you either need more memory on the workers for larger chunks, or else more memory/work for the scheduler to manage more tasks. But on my single machine I can just about do a whole S2 tile at a time.

As for what I’m modelling, whatever the job in hand. I’ve mostly been doing some crop ID work (so the ground truth takes the form of relatively sparse, isolated fields). But we also work in the forestry sector and infrastructure.

RichardScottOZ · January 28, 2021, 10:01pm

Right.

For one machine testing:-

On a 128/12 portable workstation I could do one Sentinel area - 2048*2048 chunks for a 40 datetime series for 4 10m bands at once. So 10K * 10K.

On a 768GB machine I could do one 20m band for 1/4 of South Australia for one UTM zone. Something like 30K*50K ballpark.

Topic		Replies	Views
Running Pangeo with Sentinel 2 datasets on google cloud Data	12	1723	November 6, 2020
Best practices for large scale (Sentinel-2) mosaics and 4D ML patches using Pangeo tools HPC	4	197	March 22, 2025
Intersecting shapefiles and raster data at scale - good design patterns? Science	12	4706	January 30, 2021
Tables, (x)arrays, and rasters¶	18	2940	November 15, 2022
Data catalogs (and a bit of data engineering such as datacube, STACs) and Google earth engine Data	7	1002	May 8, 2023

Best Practices for automating large scale Sentinel dataset building and Machine Learning?

Related topics