Running Pangeo with Sentinel 2 datasets on google cloud

Hi,

We are trying Pangeo with Sentinel 2 imagery data available on google cloud & our Pangeo instance is running in GCP. However, we are not sure if sat-search and intake-stac packages are best suited for our use case. sat-search documentation suggests that it’s known to work with AWS endpoint for Sentinel 2 data, but there’s no mention of how it can be used with google endpoints. Sentinel 2 dataset can be accessed using BigQuery but it would be good to know if there are other/faster ways of accessing, searching and processing the datasets from google cloud. Can you please advice which packages would work best for us or whether sat-search can/should be used or not in our case?

Thanks!

3 Likes

Welcome Sunaina and thanks for the interesting question!

I’m hoping that @scottyhq can give some advice on this.

Hi @Sunaina_Gupta,

Welcome to the discourse, I’m curious to hear more details of your use-case! The key for sat-search and intake-stac is that STAC catalogs exist for the data you’re interested in. I don’t believe a STAC catalog exists for https://cloud.google.com/storage/docs/public-datasets/sentinel-2

If you’re able to move this particular analysis to AWS (feel free to experiment on https://aws-uswest2.pangeo.io/), Element84 has put together a nice API endpoint
https://earth-search.aws.element84.com/v0, which if you access you’ll see the following collections currently: sentinel-s2-l2a, sentinel-s2-l1c, sentinel-s2-l2a-cogs, landsat-8-l1-c1 .

We recently released a new version of Intake-STAC (0.3) with an example notebook for accessing the S2 L2A COGs, which are nice for surface property analysis because atmospheric corrections have been applied https://github.com/intake/intake-stac/blob/master/examples/aws-earth-search.ipynb .

Also the COG format is better suited to large-scale and distributed computing, and I’m assuming ultimately that is what you’re going for. Here is a good blog post on that topic https://medium.com/@VincentS/do-you-really-want-people-using-your-data-ec94cd94dc3f

1 Like

Hello @Sunaina_Gupta, @scottyhq mentioned all the major points I think.

The issue is that there isn’t a STAC API for any of the google datasets that I know of. sat-search needs a STAC compliant API and Earth-Search is only an index of some AWS datasets.

But also, the AWS Sentinel-2 data are COGs, rather than JP2K which they are on GCP. They are also not in a requester pays bucket, so you could pull the data from AWS and process on GCP without any egress costs to you. But, it’s not ideal, and it’s actually not a cool thing to do - AWS is makes the data available for free to encourage users to use their platform for processing, and they would be absorbing all the egress costs.

I also don’t know what the status of the Sentinel-2 L2A data is on GCP - it’s pretty new with historical processing just being recently completed and don’t know if GCP has it all mirrored.

I see a couple options:
1 - Create a STAC index and API for Google cloud datasets. This is a bunch of work, although some of it may be underway.
2 - Assuming the GCP does have it all mirrored, you could set up a proxy API to query Earth-search and then rewrite the asset URLs to point to GCP locations since you should be able to construct the path from the scene ID.

1 Like

@Sunaina_Gupta - it sounds like the easiest option is to just move your compute environment to AWS, no?

Hi @scottyhq, @geoskeptic,

We’re using Sentinel imagery (S2 L1C) for mapping land cover types in East Africa. Even though AWS Sentinel-2 L2A data can be accessed without any egress cost, the S2 L1C product is in a requester-pays bucket and probably requires AWS authentication.

The google earth engine has a STAC catalog for S2 L1C Sentinel-2 MSI: MultiSpectral Instrument, Level-1C (COPERNICUS/S2) which I’ve not been able to use with sat-search.

1 Like

@rabernat Yes, the easiest option would be to move to AWS. However, our preferred cloud platform is GCP for this project and our compute engines/clusters will be on GCP as well. Given that, accessing datasets on google cloud platform makes more sense from performance and cost perspectives.

1 Like

Another option would be to work with the GCP Sentinel 2 data, but roll your own functions to search the data. The way the data are stored and catalogged is described here (@scottyhq already posted this link):

In general, it makes sense to do your processing wherever the big data live. In this case, the AWS data appear to have several technical advantages (STAC-compliant catalog, COGs, existence of a L2A corrected product). It may make sense to do some processing in AWS and then transfer the outputs of that step to GCP for integration with the rest of your platform.

One solution could be to use dask gateway to launch a Dask cluster in AWS from GCP! This is demonstrated here:

That works but is a bit hard to set up in terms of credentials. An easier route might be to use Coiled to launch your Dask cluster in AWS from GCP:

1 Like

Thanks for your useful suggestions @geoskeptic. I do have some questions about this point:

I understand your point about egress costs. But I feel that framing it as “not a cool thing to do” is confusing. It is certainly not cool to blow up the cloud bills of other academic and research organizations who provide public data on their own very limited budget. This has happened to us in Pangeo, motivating us to put most of our own data into requester-pays mode. But what exactly is our responsibility to Amazon itself, one of the richest and most powerful companies in the world? Jeff Bezos is worth $200 Billion. Are scientists working on sustainable development with a shoestring budget ethically obligated to help Amazon save money? This is not to disparage AWS or any of the great people who work there and collaborate closely with us–just saying that it feels weird to frame this as the user’s ethical responsibility, rather than a business transaction.

My understanding has been that, if AWS makes data public, people can feel free to do whatever they like with it. If AWS doesn’t like this, they have a clear option: turn on requester-pays, which forces people into their platform.

A slightly more nuanced argument might be: if too many people do this, AWS may reconsider their public dataset policies and stop hosting these valuable datasets, to the detriment of everyone. Even for this case, I’d be curious to know whether cross-cloud egress costs are significant in terms of their decision making.

@geoskeptic As per Hasan’s message, I think he was able to access L2A data from AWS (without specifying any AWS credentials) but not L1C data, even though they both are in Requester Pays bucket as per Sentinel-2 - Registry of Open Data on AWS, which seems confusing.

1 Like

Thanks @rabernat. Appreciate your response and suggestions. We will look into it these options.

Thanks @scottyhq for sharing this info. It’s good to know!

Thanks… will give that a shot shortly.