Software tools & optimizations - Page 3

Software tools & optimizations - Page 3#

Discover the latest blogs about ROCm software tools, libraries, and performance optimizations to help you get the most out of your AMD hardware.

November 12, 2025

Practical, Fault‑Robust Distributed Inference for DeepSeek on AMD MI300X

Learn how a small-radius expert parallel design with prefill–decode disaggregation enables scalable, fault-isolated LLM inference on AMD Instinct™ MI300X clusters.

./software-tools-optimization/wide-ep-deepseek/README.html

November 04, 2025

Stability at Scale: AMD’s Full‑Stack Platform for Large‑Model Training

Primus streamlines LLM training on AMD GPUs with unified configs, multi-backend support, preflight validation, and structured logging.

./software-tools-optimization/primus-SaFE/README.html

October 29, 2025

High-Accuracy MXFP4, MXFP6, and Mixed-Precision Models on AMD GPUs

Learn to leverage AMD Quark for efficient MXFP4/MXFP6 quantization on AMD Instinct accelerators with high accuracy retention.

./software-tools-optimization/mxfp4-mxfp6-quantization/README.html

October 23, 2025

Performance Profiling on AMD GPUs - Part 3: Advanced Usage

Part 3 of our GPU profiling series guides beginners through practical steps to identify and optimize kernel bottlenecks using ROCm tools

./software-tools-optimization/profiling-guide/advanced/README.html

October 20, 2025

ROCm 7.9 Technology Preview: ROCm Core SDK and TheRock Build System

Introduce ROCm Core SDK, and learn to install and build ROCm components easily using TheRock.

./software-tools-optimization/therock/README.html

October 14, 2025

Gumiho: A New Paradigm for Speculative Decoding — Earlier Tokens in a Draft Sequence Matter More

Gumiho boosts LLM inference with early-token accuracy, blending serial + parallel decoding for speed, accuracy, and ROCm-optimized deployment.

./software-tools-optimization/gumiho/README.html

October 09, 2025

GEMM Tuning within hipBLASLt– Part 2

Learn how to use hipblaslt-bench for offline GEMM tuning in hipBLASLt—benchmark, save, and apply custom-tuned kernels at runtime.

./software-tools-optimization/hipblaslt-offline-tuning-part2/README.html

October 03, 2025

Elevating 3D Scene Rendering with GSplat

ROCm Port of GSplat - GPU accelerated rasterization of Gaussian splatting

./software-tools-optimization/gsplat/README.html

October 01, 2025

GPU Partitioning Made Easy: Pack More AI Workloads Using AMD GPU Operator

What’s New in AMD GPU Operator: Learn About GPU Partitioning and New Kubernetes Features

./software-tools-optimization/gpu-operator-partitioning/README.html

September 30, 2025

Matrix Core Programming on AMD CDNA™3 and CDNA™4 architecture

This blog post explains how to use Matrix Cores on CDNA3 and CDNA4 architecture, with a focus on low-precision data types such as FP16, FP8, and FP4

./software-tools-optimization/matrix-cores-cdna/README.html

September 19, 2025

An Introduction to Primus-Turbo: A Library for Accelerating Transformer Models on AMD GPUs

Primus streamlines training on AMD ROCm, from fine-tuning to massive pretraining on MI300X GPUs—faster, safer, and easier to debug

./software-tools-optimization/primus-large-models/README.html

September 11, 2025

Efficient LLM Serving with MTP: DeepSeek V3 and SGLang on AMD Instinct GPUs

This blog will show you how to speed up LLM inference with Multi-Token Prediction in DeepSeek V3 & SGLang on AMD Instinct GPUs

./software-tools-optimization/mtp/README.html

Prev Page 3 of 9 Next