Posts by George Wang

Avoiding LDS Bank Conflicts on AMD GPUs Using CK-Tile Framework

25 July 2025

LDS bank conflict is a common performance bottleneck in GPU kernel development. Composable Kernel (CK-Tile), a kernel development framework for AMD GPUs, provides a framework-level solution for LDS bank conflicts. Composable Kernel for ROCm is used to build portable high-performance kernels for accelerating computing, e.g. HPC, DL and LLMs for training and inference workloads. In this blog, we show you how to analyze, detect, and eliminate LDS bank conflicts using CK-Tile, AMD’s composable GPU kernel framework. A GEMM kernel serves as a classic example for analyzing how threads interact with LDS during both reads and writes. Starting with a naïve memory layout, we evaluate bank conflict behavior, explore mitigation techniques such as padding, and ultimately demonstrate how an XOR-based swizzle transformation achieves a bank conflict-free design.

Read more ...

Vibe Coding Pac-Man Inspired Game with DeepSeek-R1 and AMD Instinct MI300X

17 July 2025

AI systems have been constrained by their narrow capabilities and limited contextual understanding. Modern large language models (LLMs), such as GPT-4, Claude, DeepSeek, and CodeLlama, are different from previous approaches to AI. LLMs leverage vast datasets and incorporate natural language and code repositories. This enables them to understand natural language syntax, semantics, and programming logic in multiple programming languages (Python, JavaScript, C++, etc.)

Read more ...

Fine-Tuning LLMs with GRPO on AMD MI300X: Scalable RLHF with Hugging Face TRL and ROCm

18 June 2025

In this blog, you will learn how to implement GRPO-based RLHF on AMD MI300X using ROCm and Hugging Face TRL—streamlining alignment training while enhancing model reasoning and inference performance. Reinforcement Learning from Human Feedback (RLHF) constitutes a critical phase in the fine-tuning of large language models (LLMs) and multimodal architectures. Over time, RLHF methodologies have advanced beyond traditional techniques, progressing from Proximal Policy Optimization (PPO) to Direct Preference Optimization (DPO), and more recently, to Group Relative Policy Optimization (GRPO). RLHF aims to make LLMs’ output better aligned with human preferences. Reinforcement Learning (RL) is an important step to enhance LLM’s reasoning capabilities and for better inference/test-time scaling law. Apart from LLM, there is also DPO application in text-to-image generation.

Read more ...

From Theory to Kernel: Implement FlashAttention-v2 with CK-Tile

21 May 2025

In our previous blog, Hands on with CK Tile we walked through how to build a basic GEMM kernel using CK-Tile. In this blog, we will further explore the implementation of a fused kernel, specifically introducing the FlashAttention (FA)-v2 forward kernel. Figure 1 provides an overview of the FlashAttention kernel executions and data movements that occur during the computation of a single thread block of output matrix. Each of the subsequent sections explains details on how to implement this using CK-Tile.

Read more ...

Accelerate DeepSeek-R1 Inference: Integrate AITER into SGLang

16 May 2025

To achieve optimized LLM performance on GPUs, high-performance AI operators/kernels are very critical. AMD recently announced AITER, a centralized repository designed to accelerate AI workloads by providing a unified collection of high-performance AI operators. It serves as a comprehensive hub for customer-level operator requests, supporting diverse needs across private, public, or custom frameworks. With both C++ and Python APIs, AITER enables developers to focus on operator development while offering flexible backend kernel implementations using Triton, CK, or assembly. AITER supports inference, training kernels, GEMM, and communication kernels, allowing flexibility across different kernel-framework pairings and architectural limitations. In this blog we will provide a comprehensive, step-by-step hands-on guide on integrating AITER operators into SGLang for DeepSeek-R1. SGLang is a fast serving framework for large language and vision language models. For DeepSeek-R1, SGLang incorporates MLA (Multi-Head Latent Attention) optimizations and supports FP8 precision (specifically W8A8 format). These enhancements enable the identification of target modules that can be replaced with AITER-optimized solutions, improving overall efficiency and performance. AITER integration delivers significant performance improvements across the entire inference pipeline while maintaining full functional equivalence with the original architecture.

Read more ...

Step-Video-T2V Inference with xDiT on AMD Instinct MI300X GPUs

15 May 2025

The Stepfun Step-Video-T2V is a 30B parameter state-of-the-art text-to-video (T2V) model capable of generating high-quality videos of up to 204 frames. As video generation advances toward Artificial General Intelligence (AGI), such models play a key role in automating and democratizing video creation. In this blog, we introduce Step-Video-T2V with xDiT running efficiently out-of-the-box on multi-GPU systems powered by AMD Instinct™ MI300X, leveraging high-bandwidth memory and ROCm ™ for fast, scalable video generation.

Read more ...

Unleash Full GPU Potential: Overlap Communication and Computation with Triton-Distributed

06 May 2025

In distributed computing, AI workloads demand both massive parallelism and efficient data movement. A primary challenge lies in effectively overlapping computation with communication to maximize performance. GPUs are excellent at crunching numbers. However, their full potential often remains untapped due to relatively long inter-GPU communication. This results in their computing units staying idle for large amounts of time while waiting for data transfer from other nodes. In this blog, we will show how you can use the Triton-Distributed framework to generate kernels that overlap communication and computation, resulting in performance that can rival highly optimized libraries.

Read more ...

Hands-On with CK-Tile: Develop and Run Optimized GEMM on AMD GPUs

15 April 2025

Composable Kernel (CK-Tile) for ROCm is used to build portable high-performance kernels for accelerating computing, e.g. HPC, DL and LLMs for training and inference workloads. CK-Tile APIs consist of vendor optimized kernels like GEMM, BatchGemm, fused-MHA, fused-MoE, SmoothQuant, element-wise kernels and many other kernels. This blog focuses on creating the most commonly used GEMM kernel, incorporating a vendor-optimized kernel pipeline and policies, and covers key CK-Tile concepts for quick learning.

Read more ...

Unlock Peak Performance on AMD GPUs with Triton Kernel Optimizations

10 April 2025

Triton is a domain-specific programming language designed to simplify GPU programming for high-performance tasks, particularly in AI applications. It provides an open-source environment that enables users to write high-level Triton code with greater productivity compared to Nvidia CUDA or AMD HIP. The Triton compiler translates Triton code into optimized GPUs instructions, effectively compiling tensor operations into low-level GPU code. It achieves high efficiency through multiple optimizations passes and leverages the underlying architecture of the GPU. To optimize GPU performance, it is important to have a solid understanding of the Triton compiler and the role it plays in kernel performance. In this blog, we will deep dive into the AMD Triton compiler, introduce Triton kernel compilation, and provide insights on how to create an efficient Triton kernel code.

Read more ...

GEMM Kernel Optimization For AMD GPUs

06 February 2025

Matrix multiplication underlies critical computational pathways in AI, with General Matrix Multiplication (GEMM) operations serving as performance-critical kernels in neural network architectures. From fully connected layers to convolutions and transformer attention mechanisms, GEMMs consume substantial computational and memory resources in large language models (LLMs). This blog explores GEMM optimization techniques for AMD GPUs, demonstrating methodologies to significantly enhance computational efficiency and performance scaling.

Read more ...