NVIDIA recently released CUDA-tiles, which basically allows you to specify how you want to do chunk-level parallelism in python code and have the GPU just do it. See cutile-python.
I’m not really a GPU person, but could we use this with parallel computation frameworks like Cubed to get GPU parallelism at massive scale for scientific array workloads?
It’s not really new new, there was a SciPy talk on it
If I’m not mistaken, cubed would need to rely on its backend_array_api (e.g. CuPy) to rewrite its algorithms (e.g. matrix multiplication) in cuTile to take advantage of this.
I’m also keeping an eye on CubeCL (Rust) and Modular (Mojo) which have similar ‘tile’/‘cube’ abstractions that are not tied to CUDA. I.e. supports AMD/ROCm and Apple/Metal ecosystem GPUs. All of these are pretty much using some MLIR dialect under the hood to perform some sort of SIMD (single instruction, multiple data) operation, which I don’t fully get yet, but is something I’m keen on experimenting next year
Would love to eventually see full GPU zarr read/decode/resampling/reprojection/rechunking/etc/encode/write pipelines and maybe cuTile can help the middle bits.
I haven’t had a chance to try cuTile yet, but that matches my understanding: cuTile is an alternative (IMO, more natural) way to author kernels that operate on in-(GPU)-memory ndarrays. It could complement cubed, especially operations like map_blocks.
Would love to eventually see full GPU zarr read/decode/resampling/reprojection/rechunking/etc/encode/write pipelines and maybe cuTile can help the middle bits.