Best Practices for automating large scale Sentinel dataset building and Machine Learning?

2TB machine couldn’t do 179 * 100K/50K at 2048,2048 chunk, or 50K*25K over South Australia

Looking at dividing SA into 4 and taking the top left quadrant

It could do 16 * 40K/30K at 2048,2048 * 4 bands at 10m (cloudless - which would be missing lots)

couldn’t do 147 * 60K/50 at 2048,20048 * 4 bands at 10m - e.g. most of what was available for that quarter of the state in the free COGs

could do 147 * 30/40 * 4 bands at 10m though - do being take the median of one of them in the above cases
but not 500 for the previous instead, or 500 or for 1 band

Similar story for an almost 4TB machine.

Yes, Dask docs have some info on this:

Yes, had looked at that a while ago, hence some ballpark limit exploring now, to get some ideas on limits.

The ‘breaking up’ part is an interesting engineering question…what is the sweet spot for request numbers/machine size/machine bandwidth for some sort of batch operation in this manner

Looking down the other end @Guy_Maskall a c5d4x.large 32gb/16cpu machine (decent networking specs) did a 25 time step median in 3 minutes for one 10m band of one sentinel scene.

8gb/4cpu c5 cousin of the above, 3:25 as opposed to the 2:47 in the above for just the blue band median processing

1/3 longer, better than 1/4 the cost? 0.422 to 0.106 per hour.

Nature paper about the Landsat bare earth models produced by Dale Roberts and John Wilford
https://www.nature.com/articles/s41467-019-13276-1

This thread may find this tweet interesting

1 Like

Interesting indeed - was talking to Dale Roberts yesterday and he did say he was doing a Sentinel-2 bare-earth mosaic in the not too distanct future.

For ODC they have this:- GitHub - opendatacube/datacube-k8s-eks: Deploy a production scale datacube cluster on AWS using EKS

Dale Roberts work is done on the NCI - but an AWS deployment anyone could do… a la the Pangeo version.

Investigating this capability, also:- Earth Data Analytics – CSIRO Centre for Earth Observation

1 Like

Trying out stackstac

I will note For a single band compute 40000*20000 approx, processed 308 times
in slightly longer than 36 times with an older method

On a 768gb ec2 - was using a bit over half the memory at the end. The bandwidth capabilities are pretty good, too.

so that is pretty great

Basically a terabyte scale thing.

@RichardScottOZ As we share almost the same target (get S2 cube of data) I was looking into the stackstac to see if could work for me as well.

In the documentation seems that the area covered isn’t going to be bigger than an MGRF(UTM) zone so there is no word on this; one cube, and that’s it.
AS for my understanding you are dealing with bigger areas, spanning on multiple MGRF(UTM) zones, how have you deal with the reprojection?

I have been looking at stackstac the last few days! Definitely recommend checking it out, some impressive performance so far.

It has built in reprojection in that sense. e.g. if I know that SA has multiple UTM zones and can say ‘give me my xarray in epsg:3107’ which is very nice.

I ran a 170 or so time series over all of SA on one 768GB machine yesterday - took about four hours. 370TB probably clipped to 200odd? Was spiking up to half memory use. That was just trying to get one band median out. Pretty good though

A STAC catalogue problem with how the search works currently is that there’s a paging problem and you get 10000 max results when you shouldn’t, so have to do that.

stackstac produced - 1 band (blue) median of zero cloud Sentinel 2 for South Australia in 2020

122 GB downstream outpout. It got up to using around 20cpus and half the memory of a 768GB ec2 machine over 4 hours.

Are there best practice tools/workflows for creating stackstac and stac-vrt compatible STAC catalogs?

We would like mimic the NAIP STAC items in the stackstac/stac-vrt example notebook by @TomAugspurger, which look like:

{'type': 'Feature',
 'geometry': {'coordinates': [[[-80.625, 26.9375],
    [-80.625, 27.0],
    [-80.6875, 27.0],
    [-80.6875, 26.9375],
    [-80.625, 26.9375]]],
  'type': 'Polygon'},
 'properties': {'created': '2021-02-19T17:39:53Z',
  'updated': '2021-02-19T17:39:53Z',
  'providers': [{'name': 'USDA Farm Service Agency',
    'roles': ['producer', 'licensor'],
    'url': 'https://www.fsa.usda.gov/programs-and-services/aerial-photography/imagery-programs/naip-imagery/'}],
  'gsd': 0.6,
  'datetime': '2019-12-15T00:00:00Z',
  'naip:state': 'fl',
  'proj:transform': [0.6, 0.0, 530802.0, 0.0, -0.6, 2986692.0, 0.0, 0.0, 1.0],
  'naip:year': '2019',
  'proj:shape': [12240, 11040],
  'proj:bbox': [530802.0, 2979348.0, 537426.0, 2986692.0],
  'proj:epsg': 26917},
 'id': 'fl_m_2608003_ne_17_060_20191215_20200113',
 'bbox': [-80.689721, 26.935508, -80.622775, 27.001976],
 'stac_version': '1.0.0-beta.2',
 'assets': {'image': {'title': 'RGBIR COG tile',
   'href': 'https://naipeuwest.blob.core.windows.net/naip/v002/fl/2019/fl_60cm_2019/26080/m_2608003_ne_17_060_20191215.tif',
   'type': 'image/tiff; application=geotiff; profile=cloud-optimized',
   'roles': ['data'],
   'eo:bands': [{'name': 'Red', 'common_name': 'red'},
    {'name': 'Green', 'common_name': 'green'},
    {'name': 'Blue', 'common_name': 'blue'},
    {'name': 'NIR', 'common_name': 'nir', 'description': 'near-infrared'}]},

The goal is to produce STAC items for a collection of USGS 1m DEM data at:
s3://prd-tnm/StagedProducts/Elevation/1m/Projects

(Here’s a notebook displaying one of these tiles, just for example:
Jupyter Notebook Viewer)

stac-vrt — stac-vrt 1.0.1 documentation (stac-vrt.readthedocs.io) lists a few things, that I think apply to both stack-vrt and stackstac.

proj:epsg
proj:shape
proj:bbox
proj:transform

There’s a couple other bits of metadata I haven’t figured out yet (see here).


On 3DEP specifically, AI for Earth is onboarding it into Azure and our STAC endpoint. Add support for USGS 3DEP (formerly NED) items and collections by gadomski · Pull Request #81 · stac-utils/stactools (github.com) is an open PR to generate STAC items. I don’t have a timeline yet, but let me know if you’re interested in using it.

I should write a dedicated post on stactools, but that’s aiming to be a standard library for generating STAC items for scenes from many datasets. Then multiple data providers can all use the same tools to generate STAC items to ensure consistency in metadata.

The next thing is - what is the most performant way to go from this to downstream ML - from transpose/reshape/stack possibilities, etc. Seems there are multiple issues to deal with in the parallel version of this activity.

Possibly another thread, as more general.

e.g. Dask reshape size error · Issue #7496 · dask/dask · GitHub