Decode GeoTIFF to GPU memory

weiji14 · June 24, 2025, 1:05pm

Sharing this blog post on speeding up Cloud-optimized GeoTIFF (COG) reads using a CUDA GPU library called nvTIFF that I’ve been playing around with for the past month. Hopefully it’ll be useful for folks struggling to use GDAL effectively, or those who don’t want to convert from COG → Format X because … data duplication.

Excerpt from the blog post:

Preliminary benchmark results from reading a 318MB Sentinel-2 True-Colour Image (TCI) Cloud-optimized GeoTIFF file (S2A_37MBV_20241029_0_L2A) with DEFLATE compression:

Violin plot of decoding speed benchmarks, GPU vs CPU1281×276 21.8 KB

Top row shows the GPU-based nvTIFF+nvCOMP taking about 0.35 seconds (~900MB/s throughput), compared to the CPU-based GDAL taking 1.05 seconds (~300MB/s throughput), and image-tiff taking 1.75 seconds (~180MB/s throughput). These Cloud-optimized GeoTIFF reads were done reading from a local disk rather than the network.

Do take this 3x speed up of nvTIFF over GDAL with a grain of salt. On one hand, I probably haven’t optimized the GDAL or nvTIFF code that much yet. On the other hand, I haven’t even included the CPU → GPU transfer cost if I were using this for a machine learning workflow!

For those interested, the code is currently in an experimental PR here. Hoping to polish things up in the next few months, but appreciate any feedback!

Michael_Sumner · June 24, 2025, 4:15pm

awesome really appreciate all this detail! could you please run this in python and report timing so I can gauge what’s comparable on your system?

from osgeo import gdal
gdal.UseExceptions()

import time
t0 = time.time()
ds = gdal.Open("TCI.tif")  ## in an empty directory, or set GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR
d = ds.ReadRaster()
t1 = time.time()
print(t1 - t0)  ## about 2 seconds for me

type(d)
#<class 'bytearray'>
len(d)
#361681200
ds.RasterXSize * ds.RasterXSize * ds.RasterCount
#361681200

I can get the same timing in R by parallelizing read of blocks (16cpus), but on my system 2s seems to be the best I can see, and increasing block size (multiples of native 1024) had no impact. (I have better disk perf on another system but don’t have access to that rn). I’m excited to explore this Rust code.

TomNicholas · June 24, 2025, 4:53pm

Awesome! I’m curious - IIUC correctly we already have zarr-to-GPU, and we have VirtualiZarr, so could we just use the zarr-to-GPU + VirtualiZarr as a runtime translation layer to achieve the same thing that this GeoTIFF-to-GPU code does?

Michael_Sumner · June 24, 2025, 6:43pm

also fwiw, using LIBERTIFF provides quite a bit more perf:

from osgeo import gdal
gdal.UseExceptions()

ds = gdal.OpenEx("TCI.tif", allowed_drivers = ["LIBERTIFF"], open_options = ["NUM_THREADS=16"])
d = ds.ReadRaster()

weiji14 · June 24, 2025, 9:20pm

Here’s the timings @Michael_Sumner

Standard GTiff driver (GDAL 3.10.3)

from osgeo import gdal
import time
import os

gdal.UseExceptions()
os.environ["GDAL_DISABLE_READDIR_ON_OPEN"] = "EMPTY_DIR"

# %%
%%timeit
t0 = time.perf_counter()
ds = gdal.Open("benches/TCI.tif")
d = ds.ReadRaster()
t1 = time.perf_counter()
print(t1 - t0)
# 1.29 s ± 37.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

1.29s (Python) - 1.05s (Rust) means about 0.25s of extra overhead from Python.

LiberTIFF driver (GDAL 3.11.0)

# %%
%%timeit
t0 = time.perf_counter()
ds = gdal.OpenEx("benches/TCI.tif", allowed_drivers = ["LIBERTIFF"], open_options = ["NUM_THREADS=16"])
d = ds.ReadRaster()
t1 = time.perf_counter()
print(t1 - t0)
# 192 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

0.2s (LiberTIFF) - 0.35s (nvTIFF) is about 0.15s faster! Guess I’ve got some work to do (still need to benchmark true CPU → GPU timings). My guess is that for small COGs, GDAL+GTiff/LiberTIFF might be performant enough, but larger COGs could benefit from nvTIFF’s GPU-based decoding. But I’ll run the numbers to verify.

Edit: I will note though, as mentioned in the blog post, that multi-threaded GDAL+LiberTIFF will clash with Pytorch multiprocessing, so there’s still value in off-loading decoding to GPU instead of staying on the CPU. Single-threaded LiberTIFF takes 879 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on my laptop, 0.88s (LiberTIFF 1 thread) - 0.35s (nvTIFF) = 0.53s gap.

That’s what I’ve been wondering for years since this post (whether we can use kerchunk-at-that-time, VirtualiZarr now, to do direct-to-GPU reads).

My understanding is that we would need zarr-python/VirtualiZarr to support these GPU-native libs:

	CPU	GPU
TIFF metadata/IFD decoding	`async-tiff` (Rust) + `virtual-tiff` (Python)	?
Decompression	`numcodecs` (Python/Cython)	`nvCOMP` (C++)

The ? is the key part. I’m proposing that nvTIFF is the more direct way of read COGs to the GPU. The Virtualizarr-way would go through kvikio.zarr.GDSStore and if it works, that could be faster in theory since it’s using cuFile. Sadly, nvTIFF doesn’t actually use cuFile yet, but I think it’s only a matter of time.

My hot take is that reading L2 GeoTIFF data to the GPU shouldn’t need to rely on Zarr or wait for the GeoZarr spec. Also, virtualizarr is Python-only for now, and I do think we should be building something that is cross-language compatible, which GDAL+LiberTIFF is doing for CPU workflows, and I’m hoping that Rust-bindings to nvTIFF will play that role for GPU workflows.

TomNicholas · June 24, 2025, 9:49pm

That seems right.

reasonable.

I think this is totally orthogonal.

In what sense is that cross-language compatible? That you can bind to it from other low-level languages?

weiji14 · June 24, 2025, 10:59pm

Yes, writing these I/O libraries in C/Rust allows us to create bindings in Python/R/Javascript(WebAssembly)/etc. See e.g. what Arrow/GeoArrow has done for tabular data.

Besides cross-language, I’m also keen on getting cross-device compatibility working, and as mentioned here, I’m pushing on DLPack to be the standard in-memory tensor format that will allow for data exchange between Intel CPUs/CUDA GPUs/AMD ROCm/Apple Sillicon/etc. This will enable better ‘separation of storage and compute’ (Zarr is almost exclusively tied to zarr-python/xarray; same with GeoTIFF and GDAL), because then you can store data in any format that can go into DLPack, and then use whatever compute engine that reads from DLPack (Torch/JaX/MLX/etc) to run your algorithms.

TomNicholas · June 25, 2025, 3:24am

virtualizarr is Python-only for now, and I do think we should be building something that is cross-language compatible

I mean maybe we should have rust-powered virtualizarr parsers…

From your blogpost:

I’m proposing we build composable pieces to handle every layer of decoding a Cloud-optimized GeoTIFF:

Network/Disk transfer - via object_store or kvikio Remote IO

Decompression of raw bytes - through numcodecs or nvcomp

Parsing of TIFF tag metadata - handled by geotiff or nvtiff-sys

The first two steps are general enough to be used by other data formats, Zarr, HDF5, etc. It is only the third step - TIFF tag metadata parsing, which requires custom logic.

But the last step is exactly what a VirtualiZarr Parser for TIFF is meant to do! And that approach is not restricted to COGs (which zero people outside of the geospatial community use or ever will use). It effectively isolates the absolute minimum amount of code that needs to be format-specific (the parser). That’s what I imagine full composability would look like.

I’m probably missing something but couldn’t you do something like:

Parse the COG’s TIFF metadata in python using virtualizarr
Now use zarr-python / zarrs + numcodecs / nvCOMP to read and decompress actual bytes, either on CPU or GPU (using dlpack)

And if you wanted all the logic to be cross-language compatible the only part left to do is port the virtualizarr parser to be in rust/C, which would presumably be fairly straightforward if you had started by wrapping a rust-powered tiff parser like async-tiff.

weiji14 · June 25, 2025, 10:31pm

TomNicholas:

But the last step is exactly what a VirtualiZarr Parser for TIFF is meant to do! And that approach is not restricted to COGs (which zero people outside of the geospatial community use or ever will use). It effectively isolates the absolute minimum amount of code that needs to be format-specific (the parser). That’s what I imagine full composability would look like.

I’m probably missing something but couldn’t you do something like:

Parse the COG’s TIFF metadata in python using virtualizarr

Now use zarr-python / zarrs + numcodecs / nvCOMP to read and decompress actual bytes, either on CPU or GPU (using dlpack)

And if you wanted all the logic to be cross-language compatible the only part left to do is port the virtualizarr parser to be in rust/C, which would presumably be fairly straightforward if you had started by wrapping a rust-powered tiff parser like async-tiff.

I think we’re both tackling the problem (GeoTIFF to GPU parsing) from two ends, and eventually things will converge . Virtualizarr is approaching it from a protocol-based Python abstraction, what you have in Parser is essentially what is called a Trait in Rust, difference being Python does runtime-checks (to see if you conform to the protocol), whereas Rust does compile-time enforcement. I’m tackling things from a more low-level implementation side, which is either nvTIFF (CUDA GPU-based) or async-tiff (CPU-based, which I’ve also got a foot in) or whatever custom parser logic that still needs to exist to read the GeoTIFF format.

Protocols are tricky to define, and I do have a lot of respect for how things have evolved from kerchunk’s JSON-based format to parquet to Virtualizarr’s Parser protocol! I think we’ve more or less settled on a network protocol (fsspec/object_store), and buffer/bytes decompression protocol (numcodecs/compress trait), it’s that last mile of parsing custom n-dimensional file formats (HDF5/TIFF/Zarr/etc) that I see Virtualizarr trying to solve, and I think you’re doing a good job for CPU-based parsing in Python at the moment, but we might need more work (in the future) to support cross-device (CUDA/ROCm/Metal) and cross-language (Python/R/Javascript/etc) which is where I’m getting at with DLPack.

TomNicholas · June 26, 2025, 3:57pm

Virtualizarr is approaching it from a protocol-based Python abstraction […] I’m tackling things from a more low-level implementation side

I don’t disagree, but I see the difference as more that you’re concentrating on efficiently decompressing and moving array chunk bytes into CPU/GPU memory (which on it’s own is a format-agnostic wish), and I’m focusing on a general framework for parsing the file format metadata needed to know the locations/compression options of those array chunks. I agree they are complementary.

I think that’s a good way to frame it.

I guess what I don’t quite understand is why DLPack standardization has anything to do with the specific file format at all? IIUC DLPack is a solution for making moving array bytes cross-device and cross-language. That seems totally unrelated to the metadata parsing?

FYI I think there is at least one rust-powered tabular-to-arrow equivalent of what we are trying to build: GitHub - abdenlab/oxbow: Oxbow makes genomic data ready for high-performance analytics. .

TomNicholas · June 26, 2025, 4:10pm

Clarification that I should have asked earlier: In your blog post above - is it assumed that we are reading data from a local SSD, rather than from cloud object storage? If so then do we have a way to fetch data from object storage direct to GPU using rust?

TomAugspurger · June 26, 2025, 7:17pm

If so then do we have a way to fetch data from object storage direct to GPU using rust?

This would require technologies like GPUDirect RDMA, which I think require some specific hardware and drivers that some cloud VMs apparently support. I’m not 100% sure whether that will help with reads from S3 or not, or whether just RDMA between two EC2 instances is supported.

FWIW, I think our software stack is a ways away from being bound by OS overhead (which RDMA bypasses) and I’m hopeful that the overhead of host to device memory transfers can be masked by using things like pinned memory (nice intro here
) and overlapping I/O and computation.

RichardScottOZ · July 3, 2025, 5:54am

Saw this recently too @weiji14 GitHub - microsoft/pytorch-cloud-geotiff-optimization: A toolkit for optimizing cloud GeoTIFF streaming in PyTorch. Achieves 20x throughput and 90% GPU utilization through optimized data loading and compression. Paper: https://arxiv.org/pdf/2506.06235

Topic		Replies	Views
Cloud-optimized access to Sentinel-2 JPEG2000 Data	10	348	May 2, 2025
How do you store multi-dimensional arrays in a single tile in a COG?	17	220	May 30, 2025
Reading HDF-EOS (HDF4) files in parallel: GDAL/rasterio/etc Science	6	857	April 15, 2024
Read multiple tiff image using zarr Data	1	1382	January 20, 2021
What's the best file format to chose for raster imagery and masks products Data	23	1593	May 18, 2025

Decode GeoTIFF to GPU memory

Standard GTiff driver (GDAL 3.10.3)

LiberTIFF driver (GDAL 3.11.0)

Related topics