Andy Luo

Andy Luo#

Andy Luo is Director of AI Application Engineer in the Artificial Intelligence group at AMD, where he leads a team to optimize AI performance and contribute to open source communities. He holds a Master’s degree in Electrical Engineering from the Fudan University. His interest lies in enabling AI developers on different ML frameworks, libraries, and toolkits.

Posts by Andy Luo

January 21, 2026

ROCm Becomes a First-Class Platform in the vLLM Ecosystem

ROCm is now a first-class vLLM platform: official wheels + Docker, stronger CI, and faster LLM & multimodal inference on AMD Instinct GPUs.

https://rocm.blogs.amd.com/software-tools-optimization/vllm-omni/README.html

January 02, 2026

Accelerating Multimodal Inference in vLLM: The One-Line Optimization for Large Multimodal Models

Learn how to optimize multimodal model inference with batch-level data parallelism for vision encoders in vLLM, achieving up to 45% throughput gains on AMD MI300X.

https://rocm.blogs.amd.com/software-tools-optimization/vllm-dp-vision/README.html

December 16, 2025

MoE Training Best Practices on AMD GPUs

Learn how to optimize Mixture-of-Experts (MoE) model training on AMD Instinct GPUs with ROCm. Maximize your AI training performance now!

https://rocm.blogs.amd.com/software-tools-optimization/primus-moe-package/README.html

November 24, 2025

The vLLM MoE Playbook: A Practical Guide to TP, DP, PP and Expert Parallelism

Learn how to combine TP, DP, PP, and EP for MoE models. Discover proven strategies to maximize performance on your vLLM deployments.

https://rocm.blogs.amd.com/software-tools-optimization/vllm-moe-guide/README.html

November 12, 2025

Practical, Fault‑Robust Distributed Inference for DeepSeek on AMD MI300X

Learn how a small-radius expert parallel design with prefill–decode disaggregation enables scalable, fault-isolated LLM inference on AMD Instinct™ MI300X clusters.

https://rocm.blogs.amd.com/software-tools-optimization/wide-ep-deepseek/README.html

November 04, 2025

Stability at Scale: AMD’s Full‑Stack Platform for Large‑Model Training

Primus streamlines LLM training on AMD GPUs with unified configs, multi-backend support, preflight validation, and structured logging.

https://rocm.blogs.amd.com/software-tools-optimization/primus-SaFE/README.html

September 30, 2025

Matrix Core Programming on AMD CDNA™3 and CDNA™4 architecture

This blog post explains how to use Matrix Cores on CDNA3 and CDNA4 architecture, with a focus on low-precision data types such as FP16, FP8, and FP4

https://rocm.blogs.amd.com/software-tools-optimization/matrix-cores-cdna/README.html

September 19, 2025

An Introduction to Primus-Turbo: A Library for Accelerating Transformer Models on AMD GPUs

Primus streamlines training on AMD ROCm, from fine-tuning to massive pretraining on MI300X GPUs—faster, safer, and easier to debug

https://rocm.blogs.amd.com/software-tools-optimization/primus-large-models/README.html

September 11, 2025

Efficient LLM Serving with MTP: DeepSeek V3 and SGLang on AMD Instinct GPUs

This blog will show you how to speed up LLM inference with Multi-Token Prediction in DeepSeek V3 & SGLang on AMD Instinct GPUs

https://rocm.blogs.amd.com/software-tools-optimization/mtp/README.html

August 05, 2025

Day 0 Developer Guide: Running the Latest Open Models from OpenAI on AMD AI Hardware

Day 0 support across our AI hardware ecosystem from our flagship AMD InstinctTM MI355X and MI300X GPUs, AMD Radeon™ AI PRO R700 GPUs and AMD Ryzen™ AI Processors

https://rocm.blogs.amd.com/ecosystems-and-partners/openai-day-0/README.html

July 07, 2025

vLLM V1 Meets AMD Instinct GPUs: A New Era for LLM Inference Performance

vLLM v1 on AMD ROCm boosts LLM serving with faster TTFT, higher throughput, and optimized multimodal support—ready out of the box.

https://rocm.blogs.amd.com/software-tools-optimization/vllmv1-rocm-llm/README.html

May 20, 2025

AMD Integrates llm-d on AMD Instinct MI300X Cluster For Distributed LLM Serving

https://rocm.blogs.amd.com/artificial-intelligence/llm-d-distributed/README.html

May 01, 2025

Optimizing DeepseekV3 Inference on SGLang Using ROCm Profiling Tools

Dive into kernel-level profiling of DeepseekV3 on SGLang—identify GPU bottlenecks and boost large language model performance using ROCm

https://rocm.blogs.amd.com/software-tools-optimization/kernel-analysis-deep/README.html

April 28, 2025

Power Up Qwen 3 with AMD Instinct: A Developer’s Day 0 Quickstart

Explore the power of Alibaba's QWEN3 models on AMD Instinct™ MI300X and MI325X GPUs - available from Day 0 with seamless SGLang and vLLM integration

https://rocm.blogs.amd.com/artificial-intelligence/qwen3-day0-amd/README.html

April 28, 2025

Boosting Llama 4 Inference Performance with AMD Instinct MI300X GPUs

Learn how to boost your Llama 4 inference performance on AMD MI300X GPUs using AITER-optimized kernels and advanced vLLM techniques

https://rocm.blogs.amd.com/software-tools-optimization/llama4-performance-b/README.html

April 06, 2025

Power Up Llama 4 with AMD Instinct: A Developer’s Day 0 Quickstart

Explore the power of Meta’s Llama 4 multimodal models on AMD Instinct™ MI300X and MI325X GPUs - available from Day 0 with seamless vLLM integration

https://rocm.blogs.amd.com/artificial-intelligence/llama4-day-0-support/README.html

March 21, 2025

Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X

Learn how to optimize DeepSeek-R1 on AMD MI300X with SGLang, AITER kernels and hyperparameter tuning for up to 5× throughput and 60% lower latency over Nvidia H200

https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html

February 21, 2025

Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU

This blog introduces the key performance optimizations made to enable DeepSeek-R1 Inference

https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html

January 29, 2025

Best practices for competitive inference optimization on AMD Instinct™ MI300X GPUs

Learn how to optimize large language model inference using vLLM on AMD's MI300X GPUs for enhanced performance and efficiency.

https://rocm.blogs.amd.com/artificial-intelligence/LLM_Inference/README.html