Recent Posts#
A Practical Guide to Running LLMs on AMD Radeon™ GPUs
This guide describes how to run LLMs on AMD Radeon™ GPUs using a range of partner frameworks, tools, and runtimes, with step-by-step setup instructions and performance optimization tips.
Building and Deploying Custom hipBLASLt Libraries on AMD Instinct GPUs
Learn how to manage hipBLASLt environments with custom source builds, RPM/DEB packaging, and version switching on AMD Instinct GPUs.
Comparative Analysis of Scale-Out RoCE Network Traffic Patterns and Loads in Training Large Language Models
Compares RoCE network traffic patterns and loads across GPT-4, Llama 3, DeepSeek-V2, and Grok 4.0 LLM training to guide AI infrastructure design.
Efficient and Portable 3D Explorable World Generation on AMD GPUs
Learn how to run Matrix3D world generation on AMD GPUs more smoothly and efficiently.
Utilizing AMD Schola and UnrealRoboticsLab with AMD ROCm™ Software to Train a Robotic Arm
Learn how to combine MuJoCo physics, Unreal Engine, and Schola to train a 6-DOF robot arm with reinforcement learning on AMD hardware.
ATOMesh: Unlocking AMD Hardware for Scalable LLM Serving
Learn how ATOMesh unlocks scalable LLM serving on AMD Instinct GPUs through distributed inference orchestration and ROCm-native execution.
Technical Dive into AMD's MLPerf Training v6.0 Submission
In this blog, we share the technical details of how we accomplish the results in our MLPerf Training v6.0 submission.
Reproducing AMD MLPerf Training v6.0 Submission Result
Learn how to reproduce AMD's MLPerf Training v6.0 submission result.
ATOM: Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization
A technical walkthrough of ATOM on AMD Instinct GPUs, covering architecture, feature scope, model coverage, and practical benchmark dashboard usage.
Low Kruskal-Rank Adaptation
Learn how Kruskal rank can enhance LoRA by replacing the conventional matrix-rank formulation for more efficient training.
Productionizing TurboQuant on AMD GPUs for KV-Cache-Bound LLM Inference
Productionized TurboQuant 4-bit KV-cache quantization on AMD GPUs via vLLM, with custom kernels and accuracy analysis on agentic workloads.
Dropless MoE Training in JAX with Primus-Turbo
Learn how to train dropless MoE in JAX/MaxText with Primus-Turbo's grouped GEMM and DeepEP all-to-all for faster, more memory-efficient training.