Last year, one of our projects was this:- [Copa de Cobre | Unearthed]
Basically, take a Sentinel satellite Mosaic and make a geological map out of it, generally with Machine Learning.
“So the scale of the problem: At 4 bytes to a word, our Sentinel-2 mosaic of the geologically interesting part of Peru is 132572 columns by 157588 rows by 10 spectral bands, each a 2-byte spectral reflectance from the deep blue to the shortwave infrared. Or around half a terabyte. Have similar scale problems with states of Australia, being of similar size, or multiples of.”.
I would be interested in people thoughts on developing a less laborious workflow, so that the analysts/consultants have more time to spend on investigation and modelling and science as opposed to data wrangling.
Starting with a bounding box for example, for South Australia (or a polygon outline) and having the data automatically retrieved, e.g. to S3. To some specification to start with, most cloud free, most vegetation free or whatever else might be feasible, those being two obvious things for geology. Tell it to start Friday and see where you are at Monday type of things.
Then you end up with raster data that becomes machine learning data. Which is maybe ok at this level - but not necessarily if we wanted to work on all of Brazil. Mentioning this sort of thing to AWS data scientists/engineers has caused their eyes to widen, too - even referring back to Pangeo.
If you make all this Zarr, for example - Zarr in what format - a store? Given the numbers above can be many millions of files as a standard Zarr dataset, which is not ideal at max 1000 keys. Best sizing?
Therefore, suggestions for training and inference, please?. The latter, if geologists want high resolution maps - for example, the 10m scale of some Sentinel data is a multibillion prediction problem. Time and computing heavy.
Can you do all that in object storage without too big an overhead penalty? Minor delays * several hundred billion isn’t good.
Anyone been working on multibillion scale clustering problems?
Thanks very much,
Richard