Software tools & optimizations

Software tools & optimizations#

Discover the latest blogs about ROCm software tools, libraries, and performance optimizations to help you get the most out of your AMD hardware.

Accelerating Vector Search: hipVS and hipRAFT on AMD

Learn how hipVS accelerates vector search on AMD Instinct GPUs, with notebook demos for semantic search, RAG, and recommendation systems.

November 13, 2025 by Sukriti Choudhary, Sujin Philip, Kevin Joseph, Fabricio Flores, Eliot Li, Lalith Narasimhan, Phani Vaddadi, Vish Vadlamani

Practical, Fault‑Robust Distributed Inference for DeepSeek on AMD MI300X

Learn how a small-radius expert parallel design with prefill–decode disaggregation enables scalable, fault-isolated LLM inference on AMD Instinct™ MI300X clusters.

November 12, 2025 by Peng Sun, Andy Luo, Gilbert Lei, Lingpeng Jin, Carlus Huang, Duyi Wang, Mingzhi Liu, Di Tian, Bill He, Jun Chen, Yutong Wu, Jiahao Zhou, Niko Ma

Stability at Scale: AMD’s Full‑Stack Platform for Large‑Model Training

Primus streamlines LLM training on AMD GPUs with unified configs, multi-backend support, preflight validation, and structured logging.

November 04, 2025 by Chaojun Hou, Lei Wei, Liz Li, Yao Fu, Andy Luo, Zhenyu Gu

High-Accuracy MXFP4, MXFP6, and Mixed-Precision Models on AMD GPUs

Learn to leverage AMD Quark for efficient MXFP4/MXFP6 quantization on AMD Instinct accelerators with high accuracy retention.

October 29, 2025 by Lin Zhao, Felix Marty, Spandan Tiwari, Wei Luo, Bowen Bao, Xinjun Niu, Zhaofeng Zhang, Haoyang Li, Ke Wang, Ashish Sirasao

Performance Profiling on AMD GPUs - Part 3: Advanced Usage

Part 3 of our GPU profiling series guides beginners through practical steps to identify and optimize kernel bottlenecks using ROCm tools

October 23, 2025 by Gina Sitaraman, Thomas Gibson, Luka Stanisic, Giacomo Capodaglio, Alessandro Fanfarillo, Asitav Mishra

ROCm 7.9 Technology Preview: ROCm Core SDK and TheRock Build System

Introduce ROCm Core SDK, and learn to install and build ROCm components easily using TheRock.

October 20, 2025 by Dominic Widdows, Janet Tseng, Scott Todd, Chris Sosa, Saad Rahim

Gumiho: A New Paradigm for Speculative Decoding — Earlier Tokens in a Draft Sequence Matter More

Gumiho boosts LLM inference with early-token accuracy, blending serial + parallel decoding for speed, accuracy, and ROCm-optimized deployment.

October 14, 2025 by Jinze Li, Yixing Xu, Xuanwu Yin, Dong Li, Emad Barsoum

GEMM Tuning within hipBLASLt– Part 2

Learn how to use hipblaslt-bench for offline GEMM tuning in hipBLASLt—benchmark, save, and apply custom-tuned kernels at runtime.

October 09, 2025 by Chia Hung, YangWen Huang, Carson Liao

Elevating 3D Scene Rendering with GSplat

ROCm Port of GSplat - GPU accelerated rasterization of Gaussian splatting

October 03, 2025 by Deeksha Goplani, Ish Kool, Karthik Kashyap Thatipamula, Marco Grond, Mark Granroth-Wilding, Pier Luigi Dovesi, Shaghayegh Roohi, Vish Vadlamani, Vikas C Sajjan, Phani Vaddadi

GPU Partitioning Made Easy: Pack More AI Workloads Using AMD GPU Operator

What’s New in AMD GPU Operator: Learn About GPU Partitioning and New Kubernetes Features

October 01, 2025 by Alireza Sariaslani

Matrix Core Programming on AMD CDNA™3 and CDNA™4 architecture

This blog post explains how to use Matrix Cores on CDNA3 and CDNA4 architecture, with a focus on low-precision data types such as FP16, FP8, and FP4

September 30, 2025 by Amanzhol Salykov, Andy Luo, Carlus Huang, Peng Sun

An Introduction to Primus-Turbo: A Library for Accelerating Transformer Models on AMD GPUs

Primus streamlines training on AMD ROCm, from fine-tuning to massive pretraining on MI300X GPUs—faster, safer, and easier to debug

September 19, 2025 by Xiaobo Chen, Wen Xie, Liz Li, Yao Fu, Andy Luo, Zhenyu Gu

Prev Page 1 of 7 Next