Haocong Wang

Haocong Wang#

Haocong is member of Composable Kernel team, he is technical leader of kernel performance optimization. He contributed to AMDGPU and ROCm from RDNA2 and CDNA1 to RDNA4 and CDNA4 since 2022.

Haocong specializes in developing and optimizing AMD GPU operator performance. He led the development and optimization of multi-precision matrix multiplication operators on the MI300 platform, significantly boosting AMD GPU performance for AI inference workloads. Dedicated to unlocking hardware potential through high-performance operators using Composable Kernel (CK), Haocong accelerates user workloads to deliver tangible customer value. His research interests focus on creating sustainable and versatile GPU programming paradigms, abstracting complex hardware concepts, and optimizing interactions with compilers.

Posts by Haocong Wang

July 25, 2025

Avoiding LDS Bank Conflicts on AMD GPUs Using CK-Tile Framework

This blog shows how CK-Tile’s XOR-based swizzle optimizes shared memory access in GEMM kernels on AMD GPUs by eliminating LDS bank conflicts

https://rocm.blogs.amd.com/software-tools-optimization/lds-bank-conflict/README.html

May 21, 2025

From Theory to Kernel: Implement FlashAttention-v2 with CK-Tile

Learn how to implement FlashAttention-v2 with CK-Tile: minimize memory overhead, maximize compute efficiency, and scale on AMD GPUs

https://rocm.blogs.amd.com/software-tools-optimization/ck-tile-flash/README.html