Best Practices for automating large scale Sentinel dataset building and Machine Learning?

RichardScottOZ · January 29, 2021, 12:41am

2TB machine couldn’t do 179 * 100K/50K at 2048,2048 chunk, or 50K*25K over South Australia

RichardScottOZ · January 29, 2021, 1:31am

Looking at dividing SA into 4 and taking the top left quadrant

It could do 16 * 40K/30K at 2048,2048 * 4 bands at 10m (cloudless - which would be missing lots)

couldn’t do 147 * 60K/50 at 2048,20048 * 4 bands at 10m - e.g. most of what was available for that quarter of the state in the free COGs

could do 147 * 30/40 * 4 bands at 10m though - do being take the median of one of them in the above cases
but not 500 for the previous instead, or 500 or for 1 band

RichardScottOZ · January 29, 2021, 4:49am

Similar story for an almost 4TB machine.

rabernat · January 29, 2021, 2:06pm

Yes, Dask docs have some info on this:

bluetyson · January 30, 2021, 12:05am

Yes, had looked at that a while ago, hence some ballpark limit exploring now, to get some ideas on limits.

The ‘breaking up’ part is an interesting engineering question…what is the sweet spot for request numbers/machine size/machine bandwidth for some sort of batch operation in this manner

RichardScottOZ · February 1, 2021, 8:05am

Looking down the other end @Guy_Maskall a c5d4x.large 32gb/16cpu machine (decent networking specs) did a 25 time step median in 3 minutes for one 10m band of one sentinel scene.

RichardScottOZ · February 2, 2021, 5:27am

8gb/4cpu c5 cousin of the above, 3:25 as opposed to the 2:47 in the above for just the blue band median processing

1/3 longer, better than 1/4 the cost? 0.422 to 0.106 per hour.

RichardScottOZ · February 4, 2021, 5:52am

Nature paper about the Landsat bare earth models produced by Dale Roberts and John Wilford
https://www.nature.com/articles/s41467-019-13276-1

rabernat · February 4, 2021, 11:43pm

This thread may find this tweet interesting

RichardScottOZ · February 5, 2021, 12:28am

Interesting indeed - was talking to Dale Roberts yesterday and he did say he was doing a Sentinel-2 bare-earth mosaic in the not too distanct future.

RichardScottOZ · February 5, 2021, 11:20pm

For ODC they have this:- GitHub - opendatacube/datacube-k8s-eks: Deploy a production scale datacube cluster on AWS using EKS

Dale Roberts work is done on the NCI - but an AWS deployment anyone could do… a la the Pangeo version.

RichardScottOZ · February 17, 2021, 10:52pm

Investigating this capability, also:- Earth Data Analytics – CSIRO Centre for Earth Observation

RichardScottOZ · March 26, 2021, 4:19am

Trying out stackstac

I will note For a single band compute 40000*20000 approx, processed 308 times
in slightly longer than 36 times with an older method

On a 768gb ec2 - was using a bit over half the memory at the end. The bandwidth capabilities are pretty good, too.

so that is pretty great

Basically a terabyte scale thing.

pl.marasco · March 30, 2021, 2:47pm

@RichardScottOZ As we share almost the same target (get S2 cube of data) I was looking into the stackstac to see if could work for me as well.

In the documentation seems that the area covered isn’t going to be bigger than an MGRF(UTM) zone so there is no word on this; one cube, and that’s it.
AS for my understanding you are dealing with bigger areas, spanning on multiple MGRF(UTM) zones, how have you deal with the reprojection?

RichardScottOZ · March 30, 2021, 9:49pm

I have been looking at stackstac the last few days! Definitely recommend checking it out, some impressive performance so far.

It has built in reprojection in that sense. e.g. if I know that SA has multiple UTM zones and can say ‘give me my xarray in epsg:3107’ which is very nice.

I ran a 170 or so time series over all of SA on one 768GB machine yesterday - took about four hours. 370TB probably clipped to 200odd? Was spiking up to half memory use. That was just trying to get one band median out. Pretty good though

A STAC catalogue problem with how the search works currently is that there’s a paging problem and you get 10000 max results when you shouldn’t, so have to do that.

RichardScottOZ · March 31, 2021, 9:23pm

stackstac produced - 1 band (blue) median of zero cloud Sentinel 2 for South Australia in 2020

122 GB downstream outpout. It got up to using around 20cpus and half the memory of a 768GB ec2 machine over 4 hours.

rsignell · March 31, 2021, 10:00pm

Are there best practice tools/workflows for creating stackstac and stac-vrt compatible STAC catalogs?

We would like mimic the NAIP STAC items in the stackstac/stac-vrt example notebook by @TomAugspurger, which look like:

{'type': 'Feature',
 'geometry': {'coordinates': [[[-80.625, 26.9375],
    [-80.625, 27.0],
    [-80.6875, 27.0],
    [-80.6875, 26.9375],
    [-80.625, 26.9375]]],
  'type': 'Polygon'},
 'properties': {'created': '2021-02-19T17:39:53Z',
  'updated': '2021-02-19T17:39:53Z',
  'providers': [{'name': 'USDA Farm Service Agency',
    'roles': ['producer', 'licensor'],
    'url': 'https://www.fsa.usda.gov/programs-and-services/aerial-photography/imagery-programs/naip-imagery/'}],
  'gsd': 0.6,
  'datetime': '2019-12-15T00:00:00Z',
  'naip:state': 'fl',
  'proj:transform': [0.6, 0.0, 530802.0, 0.0, -0.6, 2986692.0, 0.0, 0.0, 1.0],
  'naip:year': '2019',
  'proj:shape': [12240, 11040],
  'proj:bbox': [530802.0, 2979348.0, 537426.0, 2986692.0],
  'proj:epsg': 26917},
 'id': 'fl_m_2608003_ne_17_060_20191215_20200113',
 'bbox': [-80.689721, 26.935508, -80.622775, 27.001976],
 'stac_version': '1.0.0-beta.2',
 'assets': {'image': {'title': 'RGBIR COG tile',
   'href': 'https://naipeuwest.blob.core.windows.net/naip/v002/fl/2019/fl_60cm_2019/26080/m_2608003_ne_17_060_20191215.tif',
   'type': 'image/tiff; application=geotiff; profile=cloud-optimized',
   'roles': ['data'],
   'eo:bands': [{'name': 'Red', 'common_name': 'red'},
    {'name': 'Green', 'common_name': 'green'},
    {'name': 'Blue', 'common_name': 'blue'},
    {'name': 'NIR', 'common_name': 'nir', 'description': 'near-infrared'}]},

The goal is to produce STAC items for a collection of USGS 1m DEM data at:
s3://prd-tnm/StagedProducts/Elevation/1m/Projects

(Here’s a notebook displaying one of these tiles, just for example:
Jupyter Notebook Viewer)

TomAugspurger · April 1, 2021, 12:35am

stac-vrt — stac-vrt 1.0.1 documentation (stac-vrt.readthedocs.io) lists a few things, that I think apply to both stack-vrt and stackstac.

proj:epsg
proj:shape
proj:bbox
proj:transform

There’s a couple other bits of metadata I haven’t figured out yet (see here).

On 3DEP specifically, AI for Earth is onboarding it into Azure and our STAC endpoint. Add support for USGS 3DEP (formerly NED) items and collections by gadomski · Pull Request #81 · stac-utils/stactools (github.com) is an open PR to generate STAC items. I don’t have a timeline yet, but let me know if you’re interested in using it.

I should write a dedicated post on stactools, but that’s aiming to be a standard library for generating STAC items for scenes from many datasets. Then multiple data providers can all use the same tools to generate STAC items to ensure consistency in metadata.

RichardScottOZ · April 5, 2021, 11:50pm

The next thing is - what is the most performant way to go from this to downstream ML - from transpose/reshape/stack possibilities, etc. Seems there are multiple issues to deal with in the parallel version of this activity.

Possibly another thread, as more general.

RichardScottOZ · April 6, 2021, 12:31am

e.g. Dask reshape size error · Issue #7496 · dask/dask · GitHub

Topic		Replies	Views
Best practices for large scale (Sentinel-2) mosaics and 4D ML patches using Pangeo tools HPC	4	111	March 22, 2025
Data catalogs (and a bit of data engineering such as datacube, STACs) and Google earth engine Data	7	972	May 8, 2023
Digital Earth Africa Sentinel 2 Geomedian access Science	0	412	April 10, 2021
Webinar - Building a Planetary-Scale Earth-Observation Data Cube in Zarr News & Announcements	12	914	June 6, 2024
September 1, 2022: Handling large geo data with Julia Pangeo Showcase	6	1098	March 6, 2023

Best Practices for automating large scale Sentinel dataset building and Machine Learning?

Related topics