Overview
I am a n00b when it comes to traditional GIS raster processing. I’m trying to better understand the available tools in this space.
I’m doing this through what seems to be a very typical workflow: compute a monthly median RGB image from Sentinel 2 data over a 0.5 x 0.5 degree region.
In this post, I compare the same workflow using odc.stac and stackstac. The two give slightly different results, and I’d like to understand why a bit better. More generally, as a user, it’s confusing to have two packages with such similar functionality in our ecosystem. While choice is good for users, it’s also good when we can align on a single set of best practices.
Step 1: STAC Search
I’m finding the assets that I want using STAC. This is not controversial. This seems to be the way that everyone is accessing this sort of data.
from datetime import datetime, timedelta
import xarray as xr
import pystac_client
import stackstac
import odc.stac
from odc.geo.geobox import GeoBox
from dask.diagnostics import ProgressBar
from rasterio.enums import Resampling
dx = 3/3600 # 90m resolution
epsg = 4326
# random location in the middle of the Amazon
bounds = (-64.0, -9.0, -63.5, -8.5)
minx, miny, maxx, maxy = bounds
geom = {
'type': 'Polygon',
'coordinates': [[
[minx, miny],
[minx, maxy],
[maxx, maxy],
[maxx, miny],
[minx, miny]
]]
}
year = 2020
month = 1
start_date = datetime(year, month, 1)
end_date = start_date + timedelta(days=31)
date_query = start_date.strftime("%Y-%m-%d") + "/" + end_date.strftime("%Y-%m-%d")
items = pystac_client.Client.open(
"https://earth-search.aws.element84.com/v1"
).search(
intersects=geom,
collections=["sentinel-2-l2a"],
datetime=date_query,
limit=100,
).item_collection()
print(len(items), "scenes found")
This returns 52 scenes.
Step 2A: Load Data with odc.stac
odc.stac
way
# define a geobox for my region
geobox = GeoBox.from_bbox(bounds, crs=f"epsg:{epsg}", resolution=dx)
# lazily combine items
ds_odc = odc.stac.load(
items,
bands=["scl", "red", "green", "blue"],
chunks={'time': 5, 'x': 600, 'y': 600},
geobox=geobox,
resampling="bilinear"
)
# actually load it
with ProgressBar():
ds_odc.load()
This took 68 seconds on my VM in AWS US-East-1
# define a mask for valid pixels (non-cloud)
def is_valid_pixel(data):
# include only vegetated, not_vegitated, water, and snow
return ((data > 3) & (data < 7)) | (data==11)
ds_odc['valid'] = is_valid_pixel(ds_odc.scl)
ds_odc.valid.sum("time").plot()
# compute the masked median
rgb_median = (
ds_odc[['red', 'green', 'blue']]
.where(ds_odc.valid)
.to_dataarray(dim="band")
.transpose(..., "band")
.median(dim="time")
)
(rgb_median / rgb_median.max() * 2).plot.imshow(rgb="band", figsize=(10, 8))
Step 2B: Load Data with `stackstac
ds_ss = stackstac.stack(
items,
bounds=bounds,
assets=["scl", "red", "green", "blue"],
epsg=epsg,
resolution=(dx, dx),
chunksize=(5, -1, 600, 600),
resampling=Resampling.bilinear
).reset_coords(drop=True)
A big difference in API design is that odc.stac
returns an xarray.Dataset
, with each band in a different variable, while stackstac
returns an xarray.DataArray
, with all the bands combined into a single array. stackstac
also defaults to float 64 for this data, while odc.stac
uses much cheaper integer dtypes.
# actually load
with ProgressBar():
ds_ss.load()
This took 66 seconds, nearly identical to odc.stac.
The mask looks nearly identical
valid_ss = is_valid_pixel(ds_ss[:, 0])
valid_ss.sum("time").plot()
However, the final image looks a bit different
rgb_median_ss = (
ds_ss[:, 1:]
.where(valid_ss)
.transpose(..., "band")
.median(dim="time")
)
(rgb_median_ss / rgb_median_ss.max() * 2).plot.imshow(rgb="band", figsize=(10, 8))
Comparison
Both packages took about the same time to process the data. But there were subtle differences in what they came up with.
They made different choices about what coordinates to use.
print(ds_ss.x.values[:5])
# [-64. -63.99916667 -63.99833333 -63.9975 -63.99666667]
print(ds_odc.longitude.values[:5])
# [-63.99958333 -63.99875 -63.99791667 -63.99708333 -63.99625 ]
And most importantly, the images that they produced look different, despite using ostensibly the same parameters.
I’d love if someone more knowledgeable could help me understand what’s happening here under the hood a bit better.
I’d also love it if someone could point out a way that I could be doing this better or more efficiently.