Wednesday November 2nd 2022: Jupyter book tutorials demonstrating xarray-based workflows for cloud-hosted remote sensing data

This week Pangeo showcase welcomes Emma Marshall to talk about xarray pathways for cloud hosted remote sensing assets in Jupyter, see the announcement below.

Meeting Logistics
Title: Jupyter book tutorials demonstrating xarray-based workflows for cloud-hosted remote sensing data
Invited Speaker: Emma Marshall, University of Utah
Twitter: @EmMar_22 | Github: e-marshall | ORCID: 0000-0001-6348-977X)
Deepak Cherian (ORCiD: 0000-0002-6861-8734)
Scott Henderson (ORCiD: 0000-0003-0624-4965)
Jessica Scheick (ORCiD: 0000-0002-3421-4459)

When: Wednesday November 2nd 12PM EDT
Where: Launch Meeting - Zoom
Recent advances in satellite imagery availability, cloud-computing resources and open-source software represent exciting developments in earth science and climate research. Access to computing resources and storage are significant bottlenecks and barriers to participation that hinder efforts to undo historical legacies of exclusion in the sciences. Transitioning to cloud-based workflows has the potential to drastically increase efficiency and broaden scientific participation, two important objectives on the path to understanding and preparing mitigation strategies for a changing climate.

With evolving computational tools comes the associated need for detailed, accessible documentation and educational resources to increase usership of these tools and datasets. This work presents Jupyter Book tutorials focusing on datasets relevant to cryospheric research: Inter-Mission Time Series of Land Ice Velocity and Elevation (ITS_LIVE) glacier surface velocity and Sentinel-1 radiometric terrain-corrected (RTC) backscatter data. The tutorials demonstrate accessing and interacting with cloud-hosted remote sensing datasets on platforms such as Amazon Web Services (AWS) and Microsoft Planetary Computer (PC) using the open-source Python package xarray. We focus on complex, novel, real-world datasets that carry considerable scientific value. The tutorials were developed with an emphasis on accessible, explanatory text that includes solutions to commonly-encountered errors, step-by-step descriptions of xarray functionality, and ways to incorporate xarray tools to improve common scientific pipelines. Scaling educational resources related to remote sensing datasets, cloud-computing resources and scientific data analysis is critical in order to leverage the potential of these resources and realize their benefit. This work provides broadly applicable, accessible examples and establishes a framework through which future examples may be developed

Relevant material:


  • 5-15 minutes - Community showcase
  • 5-15 minutes - Q&A / Community check-in
  • 20-35 minutes - Agenda and Open discussion

I thought this was a fascinating talk. Thanks @e-marshall!

I’m interested in following up on the discussion about “what’s the best way to create a stack of L2 raster images”? Emma’s solution involved creating a GDAL VRT. @TomAugspurger mentioned stack-stac. I also just discovered GeoWombat:

GeoWombat provides utilities to process geospatial and time series of raster data at scale. Easily process Landsat, Sentinel, Planetscope or RGB data and others.

…which seems aimed at a similar space.

Is there a canonical workflow or dataset here? Can we do some coordinated evaluation of these different solutions?

Also cc to @scottyhq, who I know is interested in this topic.

“what’s the best way to create a stack of L2 raster images”

In my (biased) opinion, STAC is the clear winner here.

The primary thing to figure out when converting a collection of rasters to a datacube is understanding how relate to eachother spatio-temporally. In order to build the datacube, you need to know what area each raster covers in space and time.

There are multiple ways to get the spatio-temporal extents:

  1. Open each file (using rioxarray / rasterio / whatever) and check the values in the GeoTIFF headers. This is slow with many files
  2. Use some external metadata provider, like STAC.

Once you have the extents, there’s several ways to get to an xarray datastructure:

  1. Just use rioxarray.open_rasterio on the individual files (loading the extents “at runtime”) and merge / concatenate
  2. Build a VRT once ahead of time (slow), and read it in with rioxarray.open_rasterio (fast)
  3. STAC → VRT → xarray (GitHub - TomAugspurger/stac-vrt prototyped this, GDAL has STACIT - Spatio-Temporal Asset Catalog Items — GDAL documentation now)
  4. STAC → xarray DataArray with stackstac.
  5. STAC → xarray.Dataset with odc.stac.

See clarification on difference between this library and stackstac? · Issue #54 · opendatacube/odc-stac · GitHub for a comparison of stackstac and odc.stac. But the key thing is that they’re able to build the datacube without opening a single file.

It’s been a while since I looked at geowombat, but it seems to have STAC support now (Streaming data from cloud sources — GeoWombat 2.0.17 documentation; seems to use stackstac internally).

I think the simplest, fastest, and most flexible option for users is going to be via STAC, using something like stackstac or odc.stac. This does, however, require STAC metadata, which might not be available, though the tooling here is getting better and maybe it is feasible for users to generate their own STAC metadata for their collection of rasters.


“what’s the best way to create a stack of L2 raster images”

I agree STAC is preferred. VRT (xml) and STAC (json) are essentially catalogs of raster images. But focusing on Emma’s case of ~100-1000 TIFs, the question becomes, what’s the recommended way to generate a catalog? It’s just much easier right now to run gdalbuildvrt -separate stack.vrt images/*tif than to generate STAC.

An important aspect of the VRT is that it is a single file. As far as I know, to easily generate STAC with correct PROJ information you can use GitHub - developmentseed/rio-stac: Create STAC item from raster datasets. But that creates 1 JSON file per TIF. You then need to coerce these into a single file (consolidated metadata!) so that Xarray only needs to open a single file to understand the structure of the data cube. More discussion on that here Best practices for loading *static* STAC catalogs · Discussion #86 · gjoseph92/stackstac · GitHub.

So I think a tool like rio stacify stack.geojson images/*tif that returns a featureCollection could be really useful?..

1 Like