Data catalogs (and a bit of data engineering such as datacube, STACs) and Google earth engine

Hi everyone,

Firstly, I’d like to say it’s a pleasure to be back after a year working outside of geospatial (I know, what was I thinking?) :wink:

In my previous (geospatial data science) role, I worked on-prem (sigh) accessing geoTIFFs that had been downloaded from scihub and indexed into datacube. So at least I got to work with xarray and dask. This is largely how I discovered this wonderful community. A chunk of my work there centred on developing a package to intersect such raster data with vector data ground truth for ML for training and prediction.

I’m now with a very early stage start up with no/little legacy code. There’s been some use of Google earth engine (GE), but I’m not a massive fan. It doesn’t feel like the right data/platform solution. It does have a tonne of datasets available, and it does state it supports PB-scale workloads, but I get the feeling it’s more popular with non-production types simply wanting to get on and visually explore a bunch of data. I’m also particularly interested in Sentinel-5(P) data. I seem to conclude that GE provides its own L3 product (I think on a coarser 10 km grid?) rather than the ESA L2 product (latest processing gives 3.5 km by 5.5 km). And, whilst GE does provide a bunch of other data such as soil texture and pH, it seems a bit opaque from where these data originate and what their true properties are.

My expectation is that we’re going to be AWS-based. That’s my main background as well. I’m aware people use the AWS-provided Sentinel 2 data that’s freely available in S3 (I think it’s not even requester-pays?) I’m also aware of STACs (and I recall seeing a demo of stackstac. Indeed, there’s a stac demo for accessing Sentinel 2 data. But whilst there’s this for Sentinel 2, I haven’t seen a similar AWS S3-based openly available data archive for Sentinel 5(P).

So my main thrust here is really how to find a good listing of available remote sensing/geospatial datasets for the main satellites (so certainly all the Sentinels, and Landsat I guess), and ideally over other datasets such as soil pH or soil type etc and how best to go about creating an interface to them that works with xarray and the “usual” tech stack. GE does at least provide a catalog. There’s a listing of sorts on AWS Earth but it’s not super user friendly (and doesn’t seem to mention Sentinel 5).

Is my perception of the use case of GE valid?
Is it worth pulling data from the GE catalog (even from AWS??) just for the convenience of a comprehensive index of varied datasets?
If you were setting up a modern data stack on AWS that leveraged xarray, dask, Jupyter etc (on the analysis end), how would you go about setting up the data back end? What are good/best practices there? Pulling data from scihub as and when you wanted it and ingesting into datacube? Or is datacube really just for when you’ve pulled data down locally and want to index it? Or leveraging other sources?
I’ve seen references to Titiler. Is this a good route?

So basically, any good/best practice, advice, noob guides, suggested resources for information on finding and setting up data sources?

Thanks for listening (!) and any advice,

1 Like

AWS Sentinel-5P link is at Sentinel-5P Level 2 - Registry of Open Data on AWS. Another possible source is on Azure/Microsoft Planetary Computer.

Yes, go STAC! It sounds like you’re a bit familiar with Open DataCube (ODC). I’d encourage you to read clarification on difference between this library and stackstac? · Issue #54 · opendatacube/odc-stac · GitHub (if you haven’t already) which highlights the main design differences of odc-stac and stackstac.

You’re asking for a lot :laughing: Maybe these STAC browsers will help:

Before we get carried away on this, just wanted to clarify - are you looking at doing regular remote sensing visualization/analysis type work, or machine learning type stuff?


Oh thank you @weiji14 !! You’ve already clearly given me some good pointers; a steer towards something (STAC) and an article discussing differences between libraries is very much the sort of response I was hoping for. I’ll look forward to reading that.

Thanks for the Sentinel 5P AWS link. How did you know about or find that?? I (naively?) went to the AWS Earth registry of open data and plugged in “Sentinel 5” and got " currently 0 matching dataset" as a result! I just noticed the “explore the catalog” button, but that leads you to a page that doesn’t seem easily searchable; it just gives some basic filtering and browsing functionality?

The simple answers are yes and yes. Well maybe, probably. :wink: Our current primary focus is on nitrogen: nitrogen applied to fields as fertilizer, nitrogen emitted back into the atmosphere or lost through ground water etc. Whilst I’m “all about” the ML, I’m firmly in the “ML where it’s relevant” camp. So I’d say there’ll be a strong element of regular RS/analysis, but the infra definitely needs to support subsequent ML (or modelling of some sort) because a number of the outcomes of interest aren’t directly measured.

1 Like

The trick is to search with a dash. I.e. type in “Sentinel-5P” and not “Sentinel 5P” :laughing:

Great, glad to see another person who’s remote-sensing-first, ML-second! I’m gonna let others chime in with other ideas first, but will make one shameless plug for this blog post - Enabling GPU-native analytics with Xarray and kvikIO that may or may not convince you to store your data in Zarr.

1 Like

Just a quick note that we have Sentinel-5P (the level 2 product you mention) on Azure through the Planetary Computer: Planetary Computer. That’s just the NetCDF files currently. In the future, we’ll have STAC items for those NetCDF files, and we should be adding a higher-level L3 product as COGs (with STAC items) as well.

You can find a full list of the datasets we host at Planetary Computer.

1 Like

I don’t know whether to smack my own head or that of the person who wrote that search term parser! Yes I found it exactly as you said. :roll_eyes:

I’ve heard good things about Zarr. I’ve previously saved output geospatial product as netCDF because it was a handy way to process predictions, and update the output, in chunks before then generating a geoTIFF from the output layer (sounds kinda clunky, but the end consumer at that place wanted the product as a geoTIFF). I’ve also found parquet worked great for pulling out raster data intersected with ground truth for subsequent ML training.

1 Like

We (NASA GES DISC) also host Sentinel 5P data on the cloud:

For working with the netcdfs in cloud also check out kerchunk.