AI Blogs#
DP Attention and TBO for DeepSeek-V4 on MI355X
Learn how ATOM improves DeepSeek-V4 inference on AMD Instinct MI355X GPUs with DP Attention scheduling and Two-Batch Overlap.
Faster Kimi-K2.5-W4A8 Decoding with EAGLE3 on AMD Instinct™ MI325X
Add EAGLE3 speculative decoding and three MoE/FMHA kernel-tuning patches to Kimi-K2.5-W4A8 inference on AMD Instinct™ MI325X with SGLang, AITER, and FlyDSL.
A Practical Guide to Running LLMs on AMD Radeon™ GPUs
This guide describes how to run LLMs on AMD Radeon™ GPUs using a range of partner frameworks, tools, and runtimes, with step-by-step setup instructions and performance optimization tips.
Building and Deploying Custom hipBLASLt Libraries on AMD Instinct GPUs
Learn how to manage hipBLASLt environments with custom source builds, RPM/DEB packaging, and version switching on AMD Instinct GPUs.
Comparative Analysis of Scale-Out RoCE Network Traffic Patterns and Loads in Training Large Language Models
Compares RoCE network traffic patterns and loads across GPT-4, Llama 3, DeepSeek-V2, and Grok 4.0 LLM training to guide AI infrastructure design.
Utilizing AMD Schola and UnrealRoboticsLab with AMD ROCm™ Software to Train a Robotic Arm
Learn how to combine MuJoCo physics, Unreal Engine, and Schola to train a 6-DOF robot arm with reinforcement learning on AMD hardware.
ATOMesh: Unlocking AMD Hardware for Scalable LLM Serving
Learn how ATOMesh unlocks scalable LLM serving on AMD Instinct GPUs through distributed inference orchestration and ROCm-native execution.
Technical Dive into AMD's MLPerf Training v6.0 Submission
In this blog, we share the technical details of how we accomplish the results in our MLPerf Training v6.0 submission.
Reproducing AMD MLPerf Training v6.0 Submission Result
Learn how to reproduce AMD's MLPerf Training v6.0 submission result.
ATOM: Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization
A technical walkthrough of ATOM on AMD Instinct GPUs, covering architecture, feature scope, model coverage, and practical benchmark dashboard usage.
Low Kruskal-Rank Adaptation
Learn how Kruskal rank can enhance LoRA by replacing the conventional matrix-rank formulation for more efficient training.
Productionizing TurboQuant on AMD GPUs for KV-Cache-Bound LLM Inference
Productionized TurboQuant 4-bit KV-cache quantization on AMD GPUs via vLLM, with custom kernels and accuracy analysis on agentic workloads.