Chunk-level GPU parallelism in using cutile-python?

NVIDIA recently released CUDA-tiles, which basically allows you to specify how you want to do chunk-level parallelism in python code and have the GPU just do it. See cutile-python.

I’m not really a GPU person, but could we use this with parallel computation frameworks like Cubed to get GPU parallelism at massive scale for scientific array workloads?

tagging the GPU crew @TomAugspurger @weiji14 @Negin_Sobhani

4 Likes

It’s not really new new, there was a SciPy talk on it :snake:

If I’m not mistaken, cubed would need to rely on its backend_array_api (e.g. CuPy) to rewrite its algorithms (e.g. matrix multiplication) in cuTile to take advantage of this.

I’m also keeping an eye on CubeCL (Rust) and Modular (Mojo) which have similar ‘tile’/‘cube’ abstractions that are not tied to CUDA. I.e. supports AMD/ROCm and Apple/Metal ecosystem GPUs. All of these are pretty much using some MLIR dialect under the hood to perform some sort of SIMD (single instruction, multiple data) operation, which I don’t fully get yet, but is something I’m keen on experimenting next year :smiley:

That’s how I understand it too from Bryce’s talk.

To me the more exciting bit is CUB exposed to python ergonomically! cuda.cccl.parallel API Reference — CUDA Core Compute Libraries

This could pair well with full zarr python GPU support if zarr python could find a home for Nvidia based codecs like Zstd Codec on the GPU by akshaysubr · Pull Request #2863 · zarr-developers/zarr-python · GitHub.

Would love to eventually see full GPU zarr read/decode/resampling/reprojection/rechunking/etc/encode/write pipelines and maybe cuTile can help the middle bits.

1 Like

I haven’t had a chance to try cuTile yet, but that matches my understanding: cuTile is an alternative (IMO, more natural) way to author kernels that operate on in-(GPU)-memory ndarrays. It could complement cubed, especially operations like map_blocks.

Would love to eventually see full GPU zarr read/decode/resampling/reprojection/rechunking/etc/encode/write pipelines and maybe cuTile can help the middle bits.

GPU-Accelerated Zarr | Tom's Blog might be worth reading if you / others are interested in GPU-accelerated Zarr workloads.

3 Likes