Posts by Ning Zhang

Step-3 Deployment Simplified: A Day 0 Developer’s Guide on AMD Instinct™ GPUs

04 September 2025

Today’s large language models (LLMs) still face high decoding costs for long-context reasoning tasks. Step-3 is a 321B-parameter open-source vision-language model (VLM) designed with hardware-aware model–system co-design to minimize decoding costs. With strong support from the open-source community—especially SGLang and Triton—AMD is excited to bring Step-3 to our Instinct™ GPU accelerators.

Read more ...

Unlock Peak Performance on AMD GPUs with Triton Kernel Optimizations

10 April 2025

Triton is a domain-specific programming language designed to simplify GPU programming for high-performance tasks, particularly in AI applications. It provides an open-source environment that enables users to write high-level Triton code with greater productivity compared to Nvidia CUDA or AMD HIP. The Triton compiler translates Triton code into optimized GPUs instructions, effectively compiling tensor operations into low-level GPU code. It achieves high efficiency through multiple optimizations passes and leverages the underlying architecture of the GPU. To optimize GPU performance, it is important to have a solid understanding of the Triton compiler and the role it plays in kernel performance. In this blog, we will deep dive into the AMD Triton compiler, introduce Triton kernel compilation, and provide insights on how to create an efficient Triton kernel code.

Read more ...

GEMM Kernel Optimization For AMD GPUs

06 February 2025

Matrix multiplication underlies critical computational pathways in AI, with General Matrix Multiplication (GEMM) operations serving as performance-critical kernels in neural network architectures. From fully connected layers to convolutions and transformer attention mechanisms, GEMMs consume substantial computational and memory resources in large language models (LLMs). This blog explores GEMM optimization techniques for AMD GPUs, demonstrating methodologies to significantly enhance computational efficiency and performance scaling.

Read more ...