Decode GeoTIFF to GPU memory

Sharing this blog post on speeding up Cloud-optimized GeoTIFF (COG) reads using a CUDA GPU library called nvTIFF that I’ve been playing around with for the past month. Hopefully it’ll be useful for folks struggling to use GDAL effectively, or those who don’t want to convert from COG → Format X because … data duplication.

Excerpt from the blog post:

Preliminary benchmark results from reading a 318MB Sentinel-2 True-Colour Image (TCI) Cloud-optimized GeoTIFF file (S2A_37MBV_20241029_0_L2A) with DEFLATE compression:

Top row shows the GPU-based nvTIFF+nvCOMP taking about 0.35 seconds (~900MB/s throughput), compared to the CPU-based GDAL taking 1.05 seconds (~300MB/s throughput), and image-tiff taking 1.75 seconds (~180MB/s throughput). These Cloud-optimized GeoTIFF reads were done reading from a local disk rather than the network.

Do take this 3x speed up of nvTIFF over GDAL with a grain of salt. On one hand, I probably haven’t optimized the GDAL or nvTIFF code that much yet. On the other hand, I haven’t even included the CPU → GPU transfer cost if I were using this for a machine learning workflow!

For those interested, the code is currently in an experimental PR here. Hoping to polish things up in the next few months, but appreciate any feedback!

4 Likes

awesome really appreciate all this detail! could you please run this in python and report timing so I can gauge what’s comparable on your system?

from osgeo import gdal
gdal.UseExceptions()

import time
t0 = time.time()
ds = gdal.Open("TCI.tif")  ## in an empty directory, or set GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR
d = ds.ReadRaster()
t1 = time.time()
print(t1 - t0)  ## about 2 seconds for me

type(d)
#<class 'bytearray'>
len(d)
#361681200
ds.RasterXSize * ds.RasterXSize * ds.RasterCount
#361681200

I can get the same timing in R by parallelizing read of blocks (16cpus), but on my system 2s seems to be the best I can see, and increasing block size (multiples of native 1024) had no impact. (I have better disk perf on another system but don’t have access to that rn). I’m excited to explore this Rust code.

Awesome! I’m curious - IIUC correctly we already have zarr-to-GPU, and we have VirtualiZarr, so could we just use the zarr-to-GPU + VirtualiZarr as a runtime translation layer to achieve the same thing that this GeoTIFF-to-GPU code does?

also fwiw, using LIBERTIFF provides quite a bit more perf:

from osgeo import gdal
gdal.UseExceptions()

ds = gdal.OpenEx("TCI.tif", allowed_drivers = ["LIBERTIFF"], open_options = ["NUM_THREADS=16"])
d = ds.ReadRaster()
1 Like

Here’s the timings @Michael_Sumner :laughing:

Standard GTiff driver (GDAL 3.10.3)

from osgeo import gdal
import time
import os

gdal.UseExceptions()
os.environ["GDAL_DISABLE_READDIR_ON_OPEN"] = "EMPTY_DIR"

# %%
%%timeit
t0 = time.perf_counter()
ds = gdal.Open("benches/TCI.tif")
d = ds.ReadRaster()
t1 = time.perf_counter()
print(t1 - t0)
# 1.29 s ± 37.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

1.29s (Python) - 1.05s (Rust) means about 0.25s of extra overhead from Python.

LiberTIFF driver (GDAL 3.11.0)

# %%
%%timeit
t0 = time.perf_counter()
ds = gdal.OpenEx("benches/TCI.tif", allowed_drivers = ["LIBERTIFF"], open_options = ["NUM_THREADS=16"])
d = ds.ReadRaster()
t1 = time.perf_counter()
print(t1 - t0)
# 192 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

0.2s (LiberTIFF) - 0.35s (nvTIFF) is about 0.15s faster! Guess I’ve got some work to do (still need to benchmark true CPU → GPU timings). My guess is that for small COGs, GDAL+GTiff/LiberTIFF might be performant enough, but larger COGs could benefit from nvTIFF’s GPU-based decoding. But I’ll run the numbers to verify.

Edit: I will note though, as mentioned in the blog post, that multi-threaded GDAL+LiberTIFF will clash with Pytorch multiprocessing, so there’s still value in off-loading decoding to GPU instead of staying on the CPU. Single-threaded LiberTIFF takes 879 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on my laptop, 0.88s (LiberTIFF 1 thread) - 0.35s (nvTIFF) = 0.53s gap.

That’s what I’ve been wondering for years since this post (whether we can use kerchunk-at-that-time, VirtualiZarr now, to do direct-to-GPU reads).

My understanding is that we would need zarr-python/VirtualiZarr to support these GPU-native libs:

CPU GPU
TIFF metadata/IFD decoding async-tiff (Rust) + virtual-tiff (Python) ?
Decompression numcodecs (Python/Cython) nvCOMP (C++)

The ? is the key part. I’m proposing that nvTIFF is the more direct way of read COGs to the GPU. The Virtualizarr-way would go through kvikio.zarr.GDSStore and if it works, that could be faster in theory since it’s using cuFile. Sadly, nvTIFF doesn’t actually use cuFile yet, but I think it’s only a matter of time.

My hot take is that reading L2 GeoTIFF data to the GPU shouldn’t need to rely on Zarr or wait for the GeoZarr spec. Also, virtualizarr is Python-only for now, and I do think we should be building something that is cross-language compatible, which GDAL+LiberTIFF is doing for CPU workflows, and I’m hoping that Rust-bindings to nvTIFF will play that role for GPU workflows.

1 Like

That seems right.

reasonable.

I think this is totally orthogonal.

In what sense is that cross-language compatible? That you can bind to it from other low-level languages?

Yes, writing these I/O libraries in C/Rust allows us to create bindings in Python/R/Javascript(WebAssembly)/etc. See e.g. what Arrow/GeoArrow has done for tabular data.

Besides cross-language, I’m also keen on getting cross-device compatibility working, and as mentioned here, I’m pushing on DLPack to be the standard in-memory tensor format that will allow for data exchange between Intel CPUs/CUDA GPUs/AMD ROCm/Apple Sillicon/etc. This will enable better ‘separation of storage and compute’ (Zarr is almost exclusively tied to zarr-python/xarray; same with GeoTIFF and GDAL), because then you can store data in any format that can go into DLPack, and then use whatever compute engine that reads from DLPack (Torch/JaX/MLX/etc) to run your algorithms.

2 Likes

virtualizarr is Python-only for now, and I do think we should be building something that is cross-language compatible

I mean maybe we should have rust-powered virtualizarr parsers… :grin:

From your blogpost:

I’m proposing we build composable pieces to handle every layer of decoding a Cloud-optimized GeoTIFF:

The first two steps are general enough to be used by other data formats, Zarr, HDF5, etc. It is only the third step - TIFF tag metadata parsing, which requires custom logic.

But the last step is exactly what a VirtualiZarr Parser for TIFF is meant to do! And that approach is not restricted to COGs (which zero people outside of the geospatial community use or ever will use). It effectively isolates the absolute minimum amount of code that needs to be format-specific (the parser). That’s what I imagine full composability would look like.

I’m probably missing something but couldn’t you do something like:

  • Parse the COG’s TIFF metadata in python using virtualizarr
  • Now use zarr-python / zarrs + numcodecs / nvCOMP to read and decompress actual bytes, either on CPU or GPU (using dlpack)

And if you wanted all the logic to be cross-language compatible the only part left to do is port the virtualizarr parser to be in rust/C, which would presumably be fairly straightforward if you had started by wrapping a rust-powered tiff parser like async-tiff.

I think we’re both tackling the problem (GeoTIFF to GPU parsing) from two ends, and eventually things will converge :twisted_rightwards_arrows:. Virtualizarr is approaching it from a protocol-based Python abstraction, what you have in Parser is essentially what is called a Trait in Rust, difference being Python does runtime-checks (to see if you conform to the protocol), whereas Rust does compile-time enforcement. I’m tackling things from a more low-level implementation side, which is either nvTIFF (CUDA GPU-based) or async-tiff (CPU-based, which I’ve also got a foot in) or whatever custom parser logic that still needs to exist to read the GeoTIFF format.

Protocols are tricky to define, and I do have a lot of respect for how things have evolved from kerchunk’s JSON-based format to parquet to Virtualizarr’s Parser protocol! I think we’ve more or less settled on a network protocol (fsspec/object_store), and buffer/bytes decompression protocol (numcodecs/compress trait), it’s that last mile of parsing custom n-dimensional file formats (HDF5/TIFF/Zarr/etc) that I see Virtualizarr trying to solve, and I think you’re doing a good job for CPU-based parsing in Python at the moment, but we might need more work (in the future) to support cross-device (CUDA/ROCm/Metal) and cross-language (Python/R/Javascript/etc) which is where I’m getting at with DLPack.