Parallel COG / TIFF Storage

wietzesuijker · August 14, 2023, 2:25pm

We’re exploring the Planetary Computer Hub for data processing. We have a requirement of a file format that’s easy to interactively explore for quality checks etc through software like QGIS. COG / plain geotiff seems excellent for this, where with zarr it’s harder to interactively explore. Please correct me if I’m wrong!!
Thanks, Tom Augspurger for suggesting zarr and xcog. xcog could be part of the solution in combination with a vrt.

Based on rioxarray’s documentation, we explored parallel COG / tif storage using dask.

Using rxr.open_rasterio() and ds.rio.to_raster(), we routed operations through the Planetary Hub’s gateway cluster. However, the dask dashboard doesn’t show the last part of the data writing step. Instead, the system seems idle for a while before completing the tiff writing.

For reference, our process resembles [this example from the pc docs], with an active dask client on a gateway cluster. In our case, the data comes from a COG in blob storage. We’re seeking a format to support dimensions - input as well as output - like {"band": 10, "x": 100_000, "y": 100_000}.

fs = AzureBlobFileSystem(kwargs)
file_url = fs.url(blob_name)
ds = rxr.open_rasterio(file_url, chunks=True, lock=False)

I’ve observed that data writing is parallelized when using rioxarray with the relevant dask args (dask.array.store), though the writing to disk is handled by the JupyterHub server node, rather than being distributed across the gateway cluster workers. Our goal is to parallelize the COG chunk writing to Azure Blob Storage.
The code snippet below highlights the buffer processing data, seemingly in memory or over the network. This setup suggests a close dependency between the hub and the gateway cluster. Our intention is to leverage the gateway more effectively, minimizing potential network i/o costs and dependency on the hub.
Relevant code snippet from the example from the pc docs:

import io

with io.BytesIO() as buffer:
    ndvi.rio.to_raster(buffer, driver="COG")
    buffer.seek(0)
    blob_client = container_client.get_blob_client("ndvi-wb.tif")
    blob_client.upload_blob(buffer, overwrite=True)

While investigating, I found the “put-block-list” method in microsoft’s docs. However, I’m not sure how relevant this is for the use case above.

Any advice?

Parallel discussion @ Parallel COG / TIFF Storage · microsoft/PlanetaryComputer · Discussion #257 · GitHub

Topic		Replies	Views
Synchronizer for Zarr + Dask on Kubernetes Data	10	1859	January 16, 2024
How do you store multi-dimensional arrays in a single tile in a COG?	17	245	May 30, 2025
RioXarray & Dask in a cloud env Cloud	7	1668	December 5, 2021
Cloud Optimized Geotiffs + Pangeo best practices Data	4	2091	January 21, 2021
Writing large datasets to tif files - best practice? Cloud	0	492	August 1, 2022

Parallel COG / TIFF Storage

Related topics