AI - Software Tools & Optimizations

AI - Software Tools & Optimizations#

The vLLM MoE Playbook: A Practical Guide to TP, DP, PP and Expert Parallelism

Learn how to combine TP, DP, PP, and EP for MoE models. Discover proven strategies to maximize performance on your vLLM deployments.

November 24, 2025 by Pin Siang Tan, Hongxia Yang, Peng Sun, Andy Luo, Jun Kang Chow, Ye Hur Cheong, Tun Jian Tan

Practical, Fault‑Robust Distributed Inference for DeepSeek on AMD MI300X

Learn how a small-radius expert parallel design with prefill–decode disaggregation enables scalable, fault-isolated LLM inference on AMD Instinct™ MI300X clusters.

November 12, 2025 by Peng Sun, Andy Luo, Gilbert Lei, Lingpeng Jin, Carlus Huang, Duyi Wang, Mingzhi Liu, Di Tian, Bill He, Jun Chen, Yutong Wu, Jiahao Zhou, Niko Ma

Stability at Scale: AMD’s Full‑Stack Platform for Large‑Model Training

Primus streamlines LLM training on AMD GPUs with unified configs, multi-backend support, preflight validation, and structured logging.

November 04, 2025 by Chaojun Hou, Lei Wei, Liz Li, Yao Fu, Andy Luo, Zhenyu Gu

High-Accuracy MXFP4, MXFP6, and Mixed-Precision Models on AMD GPUs

Learn to leverage AMD Quark for efficient MXFP4/MXFP6 quantization on AMD Instinct accelerators with high accuracy retention.

October 29, 2025 by Lin Zhao, Felix Marty, Spandan Tiwari, Wei Luo, Bowen Bao, Xinjun Niu, Zhaofeng Zhang, Haoyang Li, Ke Wang, Ashish Sirasao

ROCm 7.9 Technology Preview: ROCm Core SDK and TheRock Build System

Introduce ROCm Core SDK, and learn to install and build ROCm components easily using TheRock.

October 20, 2025 by Dominic Widdows, Janet Tseng, Scott Todd, Chris Sosa, Saad Rahim

Gumiho: A New Paradigm for Speculative Decoding — Earlier Tokens in a Draft Sequence Matter More

Gumiho boosts LLM inference with early-token accuracy, blending serial + parallel decoding for speed, accuracy, and ROCm-optimized deployment.

October 14, 2025 by Jinze Li, Yixing Xu, Xuanwu Yin, Dong Li, Emad Barsoum

GEMM Tuning within hipBLASLt– Part 2

Learn how to use hipblaslt-bench for offline GEMM tuning in hipBLASLt—benchmark, save, and apply custom-tuned kernels at runtime.

October 09, 2025 by Chia Hung, YangWen Huang, Carson Liao

GPU Partitioning Made Easy: Pack More AI Workloads Using AMD GPU Operator

What’s New in AMD GPU Operator: Learn About GPU Partitioning and New Kubernetes Features

October 01, 2025 by Alireza Sariaslani

Matrix Core Programming on AMD CDNA™3 and CDNA™4 architecture

This blog post explains how to use Matrix Cores on CDNA3 and CDNA4 architecture, with a focus on low-precision data types such as FP16, FP8, and FP4

September 30, 2025 by Amanzhol Salykov, Andy Luo, Carlus Huang, Peng Sun

An Introduction to Primus-Turbo: A Library for Accelerating Transformer Models on AMD GPUs

Primus streamlines training on AMD ROCm, from fine-tuning to massive pretraining on MI300X GPUs—faster, safer, and easier to debug

September 19, 2025 by Xiaobo Chen, Wen Xie, Liz Li, Yao Fu, Andy Luo, Zhenyu Gu

Efficient LLM Serving with MTP: DeepSeek V3 and SGLang on AMD Instinct GPUs

This blog will show you how to speed up LLM inference with Multi-Token Prediction in DeepSeek V3 & SGLang on AMD Instinct GPUs

September 11, 2025 by Chang Liu, Andy Luo, Anshul Gupta

GEMM Tuning within hipBLASLt - Part 1

We introduce a hipBLASLt tuning tool that lets developers optimize GEMM problem sizes and integrate them into the library.

September 05, 2025 by YangWen Huang, Carson Liao

Prev Page 1 of 4 Next