Posts by Haocong Wang

Avoiding LDS Bank Conflicts on AMD GPUs Using CK-Tile Framework

LDS bank conflict is a common performance bottleneck in GPU kernel development. Composable Kernel (CK-Tile), a kernel development framework for AMD GPUs, provides a framework-level solution for LDS bank conflicts. Composable Kernel for ROCm is used to build portable high-performance kernels for accelerating computing, e.g. HPC, DL and LLMs for training and inference workloads. In this blog, we show you how to analyze, detect, and eliminate LDS bank conflicts using CK-Tile, AMD’s composable GPU kernel framework. A GEMM kernel serves as a classic example for analyzing how threads interact with LDS during both reads and writes. Starting with a naïve memory layout, we evaluate bank conflict behavior, explore mitigation techniques such as padding, and ultimately demonstrate how an XOR-based swizzle transformation achieves a bank conflict-free design.

Read more ...


From Theory to Kernel: Implement FlashAttention-v2 with CK-Tile

In our previous blog, Hands on with CK Tile we walked through how to build a basic GEMM kernel using CK-Tile. In this blog, we will further explore the implementation of a fused kernel, specifically introducing the FlashAttention (FA)-v2 forward kernel. Figure 1 provides an overview of the FlashAttention kernel executions and data movements that occur during the computation of a single thread block of output matrix. Each of the subsequent sections explains details on how to implement this using CK-Tile.

Read more ...