Software tools & optimizations#
Discover the latest blogs about ROCm software tools, libraries, and performance optimizations to help you get the most out of your AMD hardware.
Performance Profiling on AMD GPUs - Part 3: Advanced Usage
Part 3 of our GPU profiling series guides beginners through practical steps to identify and optimize kernel bottlenecks using ROCm tools
ROCm 7.9 Technology Preview: ROCm Core SDK and TheRock Build System
Introduce ROCm Core SDK, and learn to install and build ROCm components easily using TheRock.
Gumiho: A New Paradigm for Speculative Decoding — Earlier Tokens in a Draft Sequence Matter More
Gumiho boosts LLM inference with early-token accuracy, blending serial + parallel decoding for speed, accuracy, and ROCm-optimized deployment.
GEMM Tuning within hipBLASLt– Part 2
Learn how to use hipblaslt-bench for offline GEMM tuning in hipBLASLt—benchmark, save, and apply custom-tuned kernels at runtime.
Elevating 3D Scene Rendering with GSplat
ROCm Port of GSplat - GPU accelerated rasterization of Gaussian splatting
GPU Partitioning Made Easy: Pack More AI Workloads Using AMD GPU Operator
What’s New in AMD GPU Operator: Learn About GPU Partitioning and New Kubernetes Features
Matrix Core Programming on AMD CDNA™3 and CDNA™4 architecture
This blog post explains how to use Matrix Cores on CDNA3 and CDNA4 architecture, with a focus on low-precision data types such as FP16, FP8, and FP4
An Introduction to Primus-Turbo: A Library for Accelerating Transformer Models on AMD GPUs
Primus streamlines training on AMD ROCm, from fine-tuning to massive pretraining on MI300X GPUs—faster, safer, and easier to debug
Efficient LLM Serving with MTP: DeepSeek V3 and SGLang on AMD Instinct GPUs
This blog will show you how to speed up LLM inference with Multi-Token Prediction in DeepSeek V3 & SGLang on AMD Instinct GPUs
GEMM Tuning within hipBLASLt - Part 1
We introduce a hipBLASLt tuning tool that lets developers optimize GEMM problem sizes and integrate them into the library.
Unleashing AMD Instinct™ MI300X GPUs for LLM Serving: Disaggregating Prefill & Decode with SGLang
Learn how prefill–decode disaggregation improves LLM inference by reducing latency, enhancing throughput, and optimizing resource usage.
AITER-Enabled MLA Layer Inference on AMD Instinct MI300X GPUs
AITER boosts DeepSeek-V3’s MLA on AMD MI300X GPUs with low-rank projections, shared KV paths & matrix absorption for 2× faster inference.