Hi everyone,
Firstly, I’d like to say it’s a pleasure to be back after a year working outside of geospatial (I know, what was I thinking?)
In my previous (geospatial data science) role, I worked on-prem (sigh) accessing geoTIFFs that had been downloaded from scihub and indexed into datacube. So at least I got to work with xarray and dask. This is largely how I discovered this wonderful community. A chunk of my work there centred on developing a package to intersect such raster data with vector data ground truth for ML for training and prediction.
I’m now with a very early stage start up with no/little legacy code. There’s been some use of Google earth engine (GE), but I’m not a massive fan. It doesn’t feel like the right data/platform solution. It does have a tonne of datasets available, and it does state it supports PB-scale workloads, but I get the feeling it’s more popular with non-production types simply wanting to get on and visually explore a bunch of data. I’m also particularly interested in Sentinel-5(P) data. I seem to conclude that GE provides its own L3 product (I think on a coarser 10 km grid?) rather than the ESA L2 product (latest processing gives 3.5 km by 5.5 km). And, whilst GE does provide a bunch of other data such as soil texture and pH, it seems a bit opaque from where these data originate and what their true properties are.
My expectation is that we’re going to be AWS-based. That’s my main background as well. I’m aware people use the AWS-provided Sentinel 2 data that’s freely available in S3 (I think it’s not even requester-pays?) I’m also aware of STACs (and I recall seeing a demo of stackstac. Indeed, there’s a stac demo for accessing Sentinel 2 data. But whilst there’s this for Sentinel 2, I haven’t seen a similar AWS S3-based openly available data archive for Sentinel 5(P).
So my main thrust here is really how to find a good listing of available remote sensing/geospatial datasets for the main satellites (so certainly all the Sentinels, and Landsat I guess), and ideally over other datasets such as soil pH or soil type etc and how best to go about creating an interface to them that works with xarray and the “usual” tech stack. GE does at least provide a catalog. There’s a listing of sorts on AWS Earth but it’s not super user friendly (and doesn’t seem to mention Sentinel 5).
Is my perception of the use case of GE valid?
Is it worth pulling data from the GE catalog (even from AWS??) just for the convenience of a comprehensive index of varied datasets?
If you were setting up a modern data stack on AWS that leveraged xarray, dask, Jupyter etc (on the analysis end), how would you go about setting up the data back end? What are good/best practices there? Pulling data from scihub as and when you wanted it and ingesting into datacube? Or is datacube really just for when you’ve pulled data down locally and want to index it? Or leveraging other sources?
I’ve seen references to Titiler. Is this a good route?
So basically, any good/best practice, advice, noob guides, suggested resources for information on finding and setting up data sources?
Thanks for listening (!) and any advice,
Guy