Posts by Bruce Xue

Accelerate DeepSeek-R1 Inference: Integrate AITER into SGLang

16 May 2025

To achieve optimized LLM performance on GPUs, high-performance AI operators/kernels are very critical. AMD recently announced AITER, a centralized repository designed to accelerate AI workloads by providing a unified collection of high-performance AI operators. It serves as a comprehensive hub for customer-level operator requests, supporting diverse needs across private, public, or custom frameworks. With both C++ and Python APIs, AITER enables developers to focus on operator development while offering flexible backend kernel implementations using Triton, CK, or assembly. AITER supports inference, training kernels, GEMM, and communication kernels, allowing flexibility across different kernel-framework pairings and architectural limitations. In this blog we will provide a comprehensive, step-by-step hands-on guide on integrating AITER operators into SGLang for DeepSeek-R1. SGLang is a fast serving framework for large language and vision language models. For DeepSeek-R1, SGLang incorporates MLA (Multi-Head Latent Attention) optimizations and supports FP8 precision (specifically W8A8 format). These enhancements enable the identification of target modules that can be replaced with AITER-optimized solutions, improving overall efficiency and performance. AITER integration delivers significant performance improvements across the entire inference pipeline while maintaining full functional equivalence with the original architecture.

Read more ...