Optimizing read from COG

I’m developing an app that can display multiband COG’s representing subsurface layers in map views and cross sections stored in Azure blobs. The files can be larger than memory, so I load them lazily in chunks.

I’m using rioxarray to open the COG and I’m handling shifting overview levels based on the map view figure extents. All very good.

But I would like to be able to display an arbitrary transect through the raster layers, and therefore I need to query a x,y points along this transect.
But due to the COG’s blocksize, it appears that rioxarray is downloading entire blocks of data at a time, which is 100-1000x more data than I need to extract the data for the transect (as I only need a sparsely sampled line through the block). I was wondering if it would be quicker to query the pixels individually instead.

I have tried to open the dataset with gdal.Open(), but this seems to load the data eagerly.

Is there a way to query individual pixels from the raster without having to load entire blocks?

Any advice is appreciated (alternative approaches as well).

1 Like

Good question @ncvangilse, Links to specific data you are using and sharing some code could help identify possible options.

As far as I’m aware the minimal unit for GDAL reads is a “tile” or “block”. Most COGs are compressed Even if you want the value of just 1 pixel you need at least two HTTP GET requests: 1 to read the metadata (default=16KB), 2nd to go get the data from a block (also default minimum 16KB). Even for just one pixel you must read these bytes to RAM, uncompress, and retrieve your specific pixel value. Note how not all COGs are equal, and differences in blocksize, dtype, and compression can matter for read performance.

GDAL is great because it can handle anything, but best performance generally requires fine-tuning via environment variables and configuration settings (try experimenting with GDAL Virtual File Systems (compressed, network hosted, etc…): /vsimem, /vsizip, /vsitar, /vsicurl, … — GDAL documentation, Configuration options — GDAL documentation).

Another option is to push forward with a non-GDAL read-optimized package like GitHub - geospatial-jeff/aiocogeo: Asynchronous cogeotiff reader. I think that was a really promising prototype!

Hi @scottyhq, thanks for your reply.
I’m currently working on a proprietary dataset, but I will try and see if I can find / make a sample I can share. Do you know a nice place to host it to public?

I have also tried to figure out if I could just derive the byte addresses for the pixels and make the http range calls myself, but this route has a long way to go for me.

In that research I found aiocogeo, which looks promising, but I see that the last commit was 3 years ago. It’s dependencies are out-of-date and required some conda environment hacking to even pip install it, but I now have a running version. Unfortunately there appears to be quite a few bugs and it is not even able to initialize the COGReader. (Some bugs are easy to fix, but currently I am stuck with where it is trying to build the geotransform ( the self.ifds[0].ModelPixelScaleTagis None and aiocogeo expects it to be an iterable).

Here is a screenshot of my prototype app that allows the user to click points on the map (left) and instantly see the cross section:

My current pseudo code to extract the cross sections:

Initialization:
  First, detect the optimal chunk size for dask:
   - chunks = dict(x=cog_blockx, y=cog_blocky, band=-1)
  
  Open the COG as arr:
   - arr = rioxarray.open_rasterio(tif_file, chunks=chunks)

map._on_points_changed_callback():
  - Extract the cross section endpoints as p1 and p2
  - get x,y coordinates for cross section as linspace from p1 to p2 with n=number of x-pixels in the cross section
  - get the cross section data by:
    arr.interp(dict(x=x, y=y), method='linear', assume_sorted=True)

It is the last step where I would like to avoid downloading entire blocks, instead of just individual pixels.

I’ll second, that the fundamental limit is the chunking in the target files - there’s nothing you can do about that in most cases. In some rare examples, a chunk might have sub substructure (inner compressed blocks) that could in principle be accessed directly, but there is no good tooling for this.

Then your question becomes one of API convenience and access style. GDAL and cogeotiff and my personal kerchunk all load chunks by byte-range and potentially concurrently with parallel decoding, but with different implementations. None of that will particularly help your memory footprint problem!

Awesome comments! It’s cool to get in touch with guys like you who know the ins and outs so well!

I suppose this means that to optimize the cross section queries I would write the tifs using block a size of 16 (which I think is the minimum for GeoTiff)? Do you see any downside to this? Do you have experience in how much overhead this will cost?
In case there is a significant overhead of using a blocksize of 16, do you know if it is possible to build the tif with different blocksizes, so the main arrays could have a blocksize of 16 to accommodate light I/O footprint of cross sections, while the overviews could have coarser blocksizes (to accommodate for efficient queries of map views)?

If you need to rechunk them anyway, you might choose zarr instead, so that you can have chunks along time that are bigger than 1.

Geotiff has the benefit of being compatible with most software packages on the market for subsurface modelling and GIS - I think the adoptance of zarr is somewhat lower?

In THIS forum, adoption of zarr is high :). But indeed not so elsewhere.

1 Like

For example files using remote access you might be able to just use GitHub: GitHub - scottyhq/share-a-raster: template repository for sharing a COG

There is an example there to obscure proprietary files using rio-faux, but for performance questions like this it’s probably best to use the actual data so even a public subset would be helpful.

Note that on the Pangeo JupyterHub there is also a “Scratch Bucket” so that large datasets can be temporarily uploaded to and that way anyone on the JupyterHub can access them, so it’s a ‘semi-public’ alternative

I suppose this means that to optimize the cross section queries I would write the tifs using block a size of 16 (which I think is the minimum for GeoTiff)? Do you see any downside to this? Do you have experience in how much overhead this will cost?

You could definitely try it! If you are exposing the data to users only via an App it might make sense, but if eventually you want to allow people to do other things with this data a more standard minimum blocksize of 256x256 would probably be best. There is always going to be a tradeoff I think between data flexibility and performance.

I found aiocogeo, which looks promising, but I see that the last commit was 3 years ago

Just needs someone with enough familiarity and motivation to pick it back up! :slight_smile: Sounds like it’s not currently in a good state, but for read-only applications I think a tool like this could really make things snappier. Supporting aiocogeo was an idea on the stackstac roadmap but it hasn’t been a priority…