We’re exploring the Planetary Computer Hub for data processing. We have a requirement of a file format that’s easy to interactively explore for quality checks etc through software like QGIS. COG / plain geotiff seems excellent for this, where with zarr it’s harder to interactively explore. Please correct me if I’m wrong!!
Thanks, Tom Augspurger for suggesting zarr and xcog. xcog could be part of the solution in combination with a vrt.
Based on rioxarray’s documentation, we explored parallel COG / tif storage using dask.
Using rxr.open_rasterio() and ds.rio.to_raster(), we routed operations through the Planetary Hub’s gateway cluster. However, the dask dashboard doesn’t show the last part of the data writing step. Instead, the system seems idle for a while before completing the tiff writing.
For reference, our process resembles [this example from the pc docs], with an active dask client
on a gateway cluster. In our case, the data comes from a COG in blob storage. We’re seeking a format to support dimensions - input as well as output - like {"band": 10, "x": 100_000, "y": 100_000}
.
fs = AzureBlobFileSystem(kwargs)
file_url = fs.url(blob_name)
ds = rxr.open_rasterio(file_url, chunks=True, lock=False)
I’ve observed that data writing is parallelized when using rioxarray with the relevant dask args (dask.array.store), though the writing to disk is handled by the JupyterHub server node, rather than being distributed across the gateway cluster workers. Our goal is to parallelize the COG chunk writing to Azure Blob Storage.
The code snippet below highlights the buffer processing data, seemingly in memory or over the network. This setup suggests a close dependency between the hub and the gateway cluster. Our intention is to leverage the gateway more effectively, minimizing potential network i/o costs and dependency on the hub.
Relevant code snippet from the example from the pc docs:
import io
with io.BytesIO() as buffer:
ndvi.rio.to_raster(buffer, driver="COG")
buffer.seek(0)
blob_client = container_client.get_blob_client("ndvi-wb.tif")
blob_client.upload_blob(buffer, overwrite=True)
While investigating, I found the “put-block-list” method in microsoft’s docs. However, I’m not sure how relevant this is for the use case above.
Any advice?
Parallel discussion @ Parallel COG / TIFF Storage · microsoft/PlanetaryComputer · Discussion #257 · GitHub