Chunk-level GPU parallelism in using cutile-python?

TomNicholas · December 10, 2025, 4:27am

NVIDIA recently released CUDA-tiles, which basically allows you to specify how you want to do chunk-level parallelism in python code and have the GPU just do it. See cutile-python.

I’m not really a GPU person, but could we use this with parallel computation frameworks like Cubed to get GPU parallelism at massive scale for scientific array workloads?

github.com/cubed-dev/cubed

Is there a parallel between tile GPU/TPU kernes and Cubed chunks?

opened 01:48PM - 25 Jun 24 UTC

alxmrs

Tile based operations have been quite a success for creating optimal GPU kernels…. The programming model, in my understanding, offers flexibility while taking advantage of cache hierarchies. http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf The [triton language](https://triton-lang.org/) takes advantage of this model by providing a sort of MLIR/LLVM middleware for custom kernel acceleration of specific NN ops. Jax even now offers its own portable version of kennel control with time semantics via Pallas. https://jax.readthedocs.io/en/latest/pallas/index.html I can’t help but think that there are parallels between Cubed’s chunked blockwise op and these tile based techniques. What could an intersection look like? - Maybe, as is, business logic written in cubed would have affordances for GPU/TPU lowering - If not, how can we make that so? - More diabolical still, could Cubed do this for users automatically when accelerated arrays are used (#304)? How similar are tiles to chucks, anyway? The array-aware abstractions of Cubed, to me, seem to offer enough information to make optimizations in compute. Where this is limited, I suspect modifications to Spec could make the difference.

tagging the GPU crew @TomAugspurger @weiji14 @Negin_Sobhani

weiji14 · December 11, 2025, 8:29pm

It’s not really new new, there was a SciPy talk on it

If I’m not mistaken, cubed would need to rely on its backend_array_api (e.g. CuPy) to rewrite its algorithms (e.g. matrix multiplication) in cuTile to take advantage of this.

I’m also keeping an eye on CubeCL (Rust) and Modular (Mojo) which have similar ‘tile’/‘cube’ abstractions that are not tied to CUDA. I.e. supports AMD/ROCm and Apple/Metal ecosystem GPUs. All of these are pretty much using some MLIR dialect under the hood to perform some sort of SIMD (single instruction, multiple data) operation, which I don’t fully get yet, but is something I’m keen on experimenting next year

dcherian · December 12, 2025, 6:33pm

That’s how I understand it too from Bryce’s talk.

To me the more exciting bit is CUB exposed to python ergonomically! cuda.cccl.parallel API Reference — CUDA Core Compute Libraries

Len · December 13, 2025, 6:50pm

This could pair well with full zarr python GPU support if zarr python could find a home for Nvidia based codecs like Zstd Codec on the GPU by akshaysubr · Pull Request #2863 · zarr-developers/zarr-python · GitHub.

Would love to eventually see full GPU zarr read/decode/resampling/reprojection/rechunking/etc/encode/write pipelines and maybe cuTile can help the middle bits.

TomAugspurger · December 13, 2025, 9:44pm

I haven’t had a chance to try cuTile yet, but that matches my understanding: cuTile is an alternative (IMO, more natural) way to author kernels that operate on in-(GPU)-memory ndarrays. It could complement cubed, especially operations like map_blocks.

Would love to eventually see full GPU zarr read/decode/resampling/reprojection/rechunking/etc/encode/write pipelines and maybe cuTile can help the middle bits.

GPU-Accelerated Zarr | Tom's Blog might be worth reading if you / others are interested in GPU-accelerated Zarr workloads.

Topic		Replies	Views
Decode GeoTIFF to GPU memory Data machine-learning	13	458	August 2, 2025
Pangeo Showcase: "Cubed: Bounded-Memory Serverless Array Processing in Xarray" Pangeo Showcase	0	700	November 13, 2023
Why learn rust? A pangeo perspective Data	19	894	March 26, 2025
Cubed: Fixed-memory serverless distributed N-dimensional array processing Cloud	1	711	May 24, 2022
How do you store multi-dimensional arrays in a single tile in a COG?	17	312	May 30, 2025

Chunk-level GPU parallelism in using cutile-python?

Related topics