Posts by Christian Gilli

Deep Dive Into 4-Wave Interleave FP8 GEMM

Our previous two posts in this GEMM optimization series covered Matrix Core instructions and 8-wave ping-pong FP8 GEMM design. Here we discuss another algorithm design introduced by HipKittens - 4-wave interleave, which further improves the performance of the 8-wave ping-pong implementation. For the most complete understanding, we recommend reading this post alongside the source code.

Read more ...