Posted in 2026

Understanding Attention Algorithms and Their Backends for Image and Video Generation

20 July 2026

ComfyUI has become a widely adopted tool for running modern generative AI workloads, from text-to-image and video generation to more advanced pipelines like WAN2.2. Its node-based workflow makes it accessible without requiring deep coding knowledge, but under the hood it still relies heavily on one of the most important building blocks in modern AI models: attention.

Read more ...

SPIR-V on ROCm: A Portable IR for AMD GPUs

20 July 2026

ROCm has compiled GPU code ahead of time (AOT), once per target architecture, since its first release. CPU software shifted to runtime just-in-time (JIT) compilation in the 1990s — Java HotSpot, V8, LuaJIT, .NET RyuJIT, PyPy — once the matrix of target × workload × deployment became too large to enumerate at build time. With its adoption of SPIR-V, ROCm now brings the compile-once, specialize-on-device model to AMD GPUs.

Read more ...

GEAK V3: Agent-Driven, Repository-Level GPU Kernel Optimization across HIP, Triton, and FlyDSL on AMD GPUs

20 July 2026

In the ever-evolving world of GPU computing, optimizing kernels for performance and efficiency is a critical challenge. Hand-tuning kernels demands deep technical expertise and manual iteration. In this blog, you will read about how GEAK v3, the latest iteration of the agent-driven framework, tackles this problem using enhanced features such as task planning, test-harness discovery, patch-based handling of multi-file kernels, dynamic memory system and expert knowledge database. Our results show improvements across three kernel languages (HIP, Triton, and FlyDSL) and both CDNA and RDNA GPUs.

Read more ...

Performance Profiling on AMD GPUs – Part 5: Profiling-Driven Kernel Optimization with an AI Code-Assist Tool

16 July 2026

In this fifth installment of the Performance Profiling on AMD GPUs blog series, the focus shifts from tutorial format to a worked end-to-end case study: driving the iterative optimization of a double-precision HIP kernel on an AMD Instinct™ MI250 Graphics Compute Die (GCD), using the same profiling workflow that the earlier parts covered, combined with a modern AI code-assist tool to keep the per-experiment loop tight. Parts 1, 2, and 3 of the series covered the foundations of GPU performance profiling and basic and advanced usage of the ROCm profiling toolkit (see Additional Resources below); Part 4 applied that toolkit to a Fortran code with OpenMP offload.

Read more ...

Multi-Accelerator Support for AIMs and AMD Solution Blueprints

16 July 2026

With the latest release of AMD enterprise AI reference stack, Version 2.2 Release notes, we’re now introducing Multi-accelerator support. AMD Inference Microservices (AIMs) now run across AMD Instinct™ GPUs (MI300X, MI325X, MI350X, MI355X), AMD Radeon™ Pro GPUs (W7900 and R9700), and AMD EPYC™ CPUs (EPYC 9965). That same coverage also extends to the AMD Solution Blueprints.

Read more ...

When a Faster Kernel Doesn’t Speed Up Serving: Profiling FP8 KV Cache on AMD Instinct MI308X

15 July 2026

This case study starts with a result that didn’t add up. We enabled FP8 KV cache (--kv-cache-dtype fp8_e4m3) on a Kimi-K2.5-W4A8 (MoE + MLA) deployment on 8× AMD Instinct MI308X. At first glance, the trace had several encouraging signs: the MLA decode kernel ran 34% faster[1] than the BF16 baseline, going from 0.190 ms/call to 0.125 ms/call, and several existing categories such as GEMM, communication, and elementwise work moved slightly lower.

Read more ...

ROCm 7.14: TheRock Goes Production and Expands AMD’s AI Software Platform

15 July 2026

17 July 2026

In this blog, you will explore the key updates and improvements that make the ROCm 7.14 release a major milestone in the evolution of AMD’s open source AI software ecosystem. At the heart of this release is the production debut of TheRock, AMD’s new automated, open-source build and release system that modernizes how ROCm is built, packaged, validated, and delivered.

Read more ...

From Vector Search to Agentic RAG: Building an Enterprise Research Analyst with hipVS

15 July 2026

hipVS and hipRAFT were introduced in this earlier blog post, which walked through the core vector search APIs. This post takes the next step: it shows what you can build with hipVS by walking through an Enterprise Research Analyst demo — an end-to-end working agentic RAG (Retrieval-Augmented Generation) system that uses hipVS as its retrieval backbone. This demo combines multi-document ingestion, LLM-powered query decomposition, parallel GPU-accelerated vector search, and cross-document synthesis with source citations, all running on AMD hardware. Along the way, it highlights the specific hipVS and hipRAFT APIs the demo exercises and shares code snippets you can adapt for your own applications.

Read more ...

LogsLop: A Tiny Summarization Tool for Enormous Log Files

14 July 2026

This blog introduces LogsLop, a summarization tool that can make huge repetitive log files tractable for human readers and artificial agents.

Read more ...

Local Image and Video Generation on AMD Ryzen™ AI Max+ Processor (Windows)

14 July 2026

Running local generative-AI image and video workflows on AMD hardware on Windows has typically meant going through the Windows Subsystem for Linux (WSL). That works, but it adds a virtualization layer, extra setup, and a mental model that many creators would rather avoid.

Read more ...

Triton-Based Optimization of Video Sparse Attention on ROCm

13 July 2026

Video generation has become a major frontier in generative modeling, driven by large-scale data and increasingly scalable architectures. Among modern architectures, Diffusion Transformers (DiTs) have emerged as a dominant paradigm [1,2,3,4] by representing videos as spatio-temporal token sequences, enabling long-range interactions across frames and spatial regions, as well as flexible multimodal conditioning with text or audio. However, full self-attention scales quadratically with token count, making it increasingly expensive as spatio-temporal resolution and model size grow. Video sparse attention (VSA) [5,6] mitigates this cost by approximating full attention with a subset of informative token interactions, but its practical efficiency in both training and inference depends heavily on hardware-aware Triton kernel implementations.

Read more ...

Serving NVFP4 Models on AMD Instinct™ MI355 Accelerators

13 July 2026

NVFP4 is an increasingly common deployment format: NVIDIA, AMD, and the open-source community have published NVFP4 quantized checkpoints of frontier models such as moonshotai/Kimi-K2.6, and many users want to deploy these checkpoints directly. AMD Instinct™ MI355 is built on the CDNA4 architecture, which has no native NVFP4 tensor execution path — meaning these checkpoints could not previously be served on MI355 without an expensive offline conversion to a different format.

Read more ...

QuickReduce INT3 Quantization and Benchmarking on MI355

13 July 2026

Large Language Models (LLMs) typically contain billions — or even tens of billions — of parameters. During inference, tensor parallelism (TP) is a widely used technique that distributes the compute across multiple GPUs. This approach, however, requires frequent, large-scale data synchronization between layers, introducing significant communication latency and placing enormous pressure on interconnect bandwidth.

Read more ...

GEAK Agent-Driven Optimization of the DeepSeekV4 MLA Kernel

13 July 2026

Optimizing LLM inference kernels requires more than a single kernel rewrite. Developers need to migrate reference implementations, analyze profiling results, validate correctness, and iterate quickly across different workload shapes. In this blog, we use DeepSeekV4 MLA as a case study to show how GEAK automates this workflow, from PyTorch-to-Triton migration to kernel-level optimization and SGLang end-to-end(E2E) validation on AMD GPUs.

Read more ...

Fast Image Generation and Editing with SGLang Diffusion on AMD GPUs

10 July 2026

Visual generative AI is advancing at an extraordinary pace. OpenAI’s GPT Image 2, released in April 2026, is a reasoning-enabled image model that performs an internal planning process before generating images. Its predecessor generated over 700 million images within just its first week after launch. Google’s Nano Banana (Gemini 3.1 Flash Image) delivers real-time, knowledge-grounded image generation and editing, while open-source video models such as HunyuanVideo can now produce fluid, high-fidelity clips from a single sentence.

Read more ...

Porting High-Performance HIP Kernels to FlyDSL

09 July 2026

In our series of posts we have explored how to utilize Matrix Core instructions and how to design high-performance GEMM kernels using HIP C++. In this post we’ll show how to port those kernels to FlyDSL, a new Python DSL developed at AMD to simplify kernel development and testing. By working through real-world examples, we’ll explore the key concepts of FlyDSL and how they map to low-level HIP C++ code. Finally, we’ll show how — despite being higher-level — FlyDSL kernels can match or even exceed the performance of hand-tuned C++ kernels with a fraction of the complexity.

Read more ...

AMD Instinct™ Network Traffic, Congestion Trends, and Harmonics in Scale-Out Networks for AI Training Clusters

09 July 2026

Error parsing meta tag attribute “keywords”: No content.

Read more ...

Towards Feature Complete Triton Support in JAX-Triton

08 July 2026

In this blog we’ll explore recent improvements to JAX-Triton project and learn about new features available at the AMD fork.

Read more ...

SGLang-ATOM: Bring ROCm-Native Acceleration to SGLang Serving

08 July 2026

Large language model serving teams often face two competing goals: keeping the flexibility and developer velocity of an ecosystem serving framework, while also reaching strong throughput, latency, and cost efficiency on production accelerators. In this blog, you will explore how SGLang-ATOM bridges these needs for AMD Instinct GPUs by connecting the SGLang serving experience with ATOM’s ROCm-native execution path.

Read more ...

Efficient Hyperparameter Optimization for Autonomous Driving Models with AMD Instinct GPU Partitioning

08 July 2026

For automotive OEMs and Tier-1 suppliers, developing production-grade perception models presents a significant bottleneck: achieving the accuracy required for safety-critical applications such as forward collision warning and autonomous emergency braking demands systematic hyperparameter optimization (HPO). However, HPO requires training hundreds of model variants, each taking hours on a single GPU, making thorough exploration of the parameter space prohibitively slow and expensive with conventional single-GPU setups.

Read more ...

RDC and RocProfiler Compared to DCGM for Commonly Used Metrics

07 July 2026

Modern GPU applications often need lightweight, repeatable performance checks that can run outside a full profiling session. A developer might want to confirm whether a workload is throttling, whether data is moving across GPU interconnects as expected, whether ECC counters are increasing, or whether the GPU is spending enough time on useful compute work. These checks are especially useful during application tuning, cluster validation, and regression testing.

Read more ...

Occupancy Math on the AMD MI355X GPU (CDNA4): A From-First-Principles Guide

07 July 2026

Ask a GPU kernel engineer how their kernel is doing and occupancy comes up within a sentence or two. It’s the number everyone quotes and the dial everyone reaches for — and, in our experience, the metric people understand least. Most treat it as an opaque percentage the profiler hands back. It isn’t. Occupancy is fully derivable by hand from a kernel’s resource usage and a handful of fixed hardware limits, and being able to do that derivation changes how you tune. In short: on MI355X GPU occupancy is set by whichever of four resource limiters runs out first (VGPRs and SGPRs — vector and scalar registers; LDS — the on-chip shared scratchpad; or workgroup/barrier slots), the VGPR file is 512 per lane, shared by regular and accumulator registers (not a separate AccVGPR pool), and maximizing occupancy is usually the wrong goal — in a measured MXFP8 MFMA sweep below, the matrix core holds ~97% of peak even as occupancy falls to a fraction of full.

Read more ...

Primus Tuning Agent: Closing the Configuration-Search Loop

06 July 2026

Error parsing meta tag attribute “keywords”: No content.

Read more ...

Accelerating Diffusers and xDiT Image Generation with MXFP4 using AMD Quark on AMD Instinct™ MI350 GPUs

06 July 2026

Diffusion models such as Black Forest Labs’ FLUX.1-dev [1] deliver stunning image quality but demand significant compute and memory bandwidth at inference time. To reduce inference cost without sacrificing image quality, precision-aware quantization techniques have become a critical optimization strategy.

Read more ...

Building a GPU-Resident YOLO26 Object Detection Pipeline on the AMD Radeon™ AI PRO R9700 GPU

03 July 2026

Modern AMD GPUs include a dedicated hardware block for video processing called the Video Core Next (VCN) engine. By chaining VCN directly into machine learning frameworks, you can build an object detection pipeline where a video frame stays in VRAM from decode to the final bounding boxes. The host sees only the surviving detections.

Read more ...

AgentKernelArena: Benchmarking AI Coding Agents for GPU Kernel Optimization on AMD Instinct GPUs

03 July 2026

AI coding agents such as Cursor Agent, Claude Code, and OpenAI Codex are improving fast, and people increasingly trust them with specialized, high-stakes work, including GPU kernel optimization. But most of the public evidence is still a cherry-picked demo on a single kernel, not a controlled, head-to-head comparison on the same tasks, the same hardware, and the same scoring rules. On AMD Instinct™ GPUs, where every percentage point of kernel performance translates directly into training and inference cost, that gap matters.

Read more ...

Accelerating Large-Scale LLM Inference on AMD Instinct MI350X/MI355X with Eagle3 and AMD Quark

03 July 2026

Large language model (LLM) inference is increasingly constrained by autoregressive decoding. Even when prefill is highly optimized, the decode phase still generates tokens one step at a time, and each step typically requires running the full target model. For large mixture-of-experts and attention-heavy models such as Kimi-K2.5 and MiniMax-M2.5, this sequential pattern limits serving throughput and increases latency for real-time applications.

Read more ...

Optimizing MI300X Inter-Chiplet Communication via the RCCL Tuner API

30 June 2026

The AMD Instinct MI300X’s chiplet architecture introduces non-uniform communication paths when running in CPX/NPS4 mode — and RCCL default algorithms don’t account for this topology.

Read more ...

OpenXLA and JAX - ROCm Support and the State of CI

29 June 2026

The OpenXLA compiler stack — XLA at the foundation, JAX as the front end — now runs upstream on AMD ROCm. XLA gates every pull request on real AMD Instinct silicon through its GitHub Actions workflow, side by side with the CUDA path; JAX runs the same hardware on every ROCm PR through its own workflows, with the merge gate rolling out next. pip install "jax[rocm7-local]" is a first-class entry point. This post documents how that backend is structured, what landed in the last twelve months, and how the CI pipeline that keeps it healthy is wired together. Part 1 covers OpenXLA on AMD — the XLA backend, what landed this year, and CI. Part 2 covers JAX on AMD — the plugin architecture, JAX-side changes, and the four-workflow test matrix.

Read more ...

Accelerating LLM Inference on AMD GPUs with Low-Latency GEMMs

29 June 2026

Large language model inference is becoming increasingly interactive. Users expect chatbots, coding assistants, agents, and real-time copilots to respond quickly, stream tokens smoothly, and stay responsive under concurrent load. In that setting, decode-time latency is not just a backend metric. It directly affects perceived quality.

Read more ...

MXFP6 and MXFP4 Mixed Precision for Accelerating Dense LLMs on AMD Instinct MI355X

26 June 2026

In this blog, you will learn how pairing MXFP6-E2M3 activations with MXFP4 weights can meaningfully recover accuracy lost to pure 4-bit MXFP4 quantization in specific workloads and configurations, while staying within 2–3% of MXFP4 throughput. You will see measured offline throughput, serving latency, and benchmark accuracy results comparing BF16, FP8, MXFP4, and W_MXFP4_A_MXFP6 on Llama-3.1-8B and Qwen3.6-27B on AMD Instinct MI355X.

Read more ...

Efficient GPU Utilization With Workload Pre-Emption in AMD Resource Manager

26 June 2026

GPU capacity is sought after and in high demand. Production inference services, fine-tuning jobs, and developer workspaces like VS Code or JupyterLab all compete for the same resources. The challenge is not just about provisioning enough GPUs, it is keeping them utilized and making sure prioritized work can access capacity when it needs it. Training jobs can drop to near-zero utilization between compute phases; inference services can go quiet between traffic bursts; R&D or experimentation models and development workspaces might be left running unutilized or after hours. This would mean that workloads hold on to GPUs they are no longer using, while other work sits queued.

Read more ...

DP Attention and TBO for DeepSeek-V4 on MI355X

24 June 2026

Running DeepSeek-V4 efficiently requires solving two intertwined problems: how to parallelize MoE communication across GPUs, and how to hide that communication behind useful compute. The dominant approach is Expert Parallel with all2all backends like DeepEP. This solves both problems, but it also requires specialized kernels, topology assumptions, and careful expert placement.

Read more ...

Faster Kimi-K2.5-W4A8 Decoding with EAGLE3 on AMD Instinct™ MI325X

23 June 2026

In our previous blog [7], we deployed Kimi-K2.5 [1] in W4A8 (INT4 weights + INT8 activations) on AMD Instinct™ MI325X, replacing the BF16 MFMA path in the fused MoE kernel with FlyDSL [2]’s INT8 MFMA implementation. The remaining bottleneck is the autoregressive nature of decoding itself: even with INT8 MFMA and INT4 weights, the framework still runs one full forward pass per generated token.

Read more ...

A Practical Guide to Running LLMs on AMD Radeon™ GPUs

19 June 2026

Running large language models on AMD Radeon™ GPUs has never been more accessible or more exciting. Thanks to rapid advancements in open‑source tooling and GPU acceleration, both Radeon™ integrated GPUs (iGPU) and discrete GPUs (dGPU) have become powerful, cost‑effective platforms for local AI. Whether you prefer a polished desktop application, a lightweight command‑line workflow, or a fully customizable runtime, a rich ecosystem of tools now makes it easy to deploy cutting‑edge models on your system. With today’s software stack, you can run state‑of‑the‑art language models directly on your Radeon™‑powered PC, whether you’re using integrated graphics or a high‑performance discrete card.

Read more ...

Efficient and Portable 3D Explorable World Generation on AMD GPUs

18 June 2026

Explorable 3D world generation is becoming a foundational capability for spatial and embodied intelligence. Training agents that can navigate, reason, and interact with environments requires not just static datasets, but rich, immersive worlds that support free-view exploration and consistent geometry. Recent works like Matrix3D¹ have pushed this frontier forward by combining panoramic generation with explicit 3D reconstruction, enabling higher-quality and more coherent environments than prior video-only methods.

Read more ...

Comparative Analysis of Scale-Out RoCE Network Traffic Patterns and Loads in Training Large Language Models

18 June 2026

As large-scale AI workloads continue to grow, understanding network behavior becomes critical. This blog analyzes scale-out RoCE traffic patterns and loads in large language model training, helping you uncover bottlenecks, improve performance, and design more scalable ROCm-based systems.

Read more ...

Building and Deploying Custom hipBLASLt Libraries on AMD Instinct GPUs

18 June 2026

General Matrix Multiply (GEMM) operations are a core component of many generative AI workloads. Whether you are running attention mechanisms in the prefill phase of a Large Language Model (LLM) or generating tokens sequentially during the decode phase, matrix multiplication performance has a direct impact on end-to-end latency and throughput.

Read more ...

Utilizing AMD Schola and UnrealRoboticsLab with AMD ROCm™ Software to Train a Robotic Arm

17 June 2026

A great reinforcement learning (RL) training environment excels along many axes. Unreal® Engine brings a powerful combination of capabilities, including physically based rendering, high-fidelity visual environments, and a mature toolset for building rich interactive scenes. These strengths make it an excellent fit for training tasks that involve complex lighting or rich vision-based observations.

Read more ...

Technical Dive into AMD’s MLPerf Training v6.0 Submission

16 June 2026

AMD is proud to share its MLPerf Training v6.0 results, marking another step forward in our commitment to delivering competitive AI training performance using the latest AMD Instinct GPUs. This round covers three benchmarks — Llama 2 70B LoRA fine-tuning, Llama 3.1 8B pretraining, and Flux.1-schnell text-to-image pretraining — with AMD’s own submissions spanning the MI325X, MI350X, and MI355X Instinct GPUs, alongside a large-scale 512-GPU MI300X submission from OCI in partnership with AMD.

Read more ...

Reproducing AMD MLPerf Training v6.0 Submission Result

16 June 2026

This blog provides a step-by-step guide for reproducing AMD’s MLPerf Training 6.0 submission results. AMD submitted results on three benchmarks this round:

Read more ...

ATOMesh: Unlocking AMD Hardware for Scalable LLM Serving

16 June 2026

Large language model serving is moving from single-engine optimization to full-stack distributed inference. Production deployments must handle high concurrency, long-context prefill, latency-sensitive decode, KV cache store pressure, and multi-node GPU utilization at the same time. On AMD Instinct GPUs, the key opportunity is to connect ROCm-native kernels, communication libraries, inference engines, and distributed orchestration into one scalable serving stack.

Read more ...

ATOM: Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization

15 June 2026

As LLM serving enters a phase defined by high concurrency, long-context workloads, sparse MoE activation, and multi-GPU deployment, the challenge is no longer basic functionality but sustaining peak efficiency on AMD GPUs under production-scale load. ATOM (AiTer Optimized Model) is built for that goal, following four core principles: system-level optimization for LLM inference on AMD Instinct™ GPUs, kernel-level acceleration through AITER, distributed inference scaling with MORI, and a rollout-engine path for RL workloads. It builds on earlier ROCm blog coverage of AITER and vLLM-ATOM, moving from kernel and plugin acceleration into the standalone ATOM inference engine. Rather than being a generic framework adapted to the ROCm™ software, ATOM is an execution engine designed with ROCm-first priorities, AITER-native operators, and deep optimization on the inference-critical path. Aligned with the AMD Instinct roadmap from single-node optimization to multi-node scale-out, ATOM evolves its architecture, kernel strategy, and distributed execution model in lockstep with each hardware generation.

Read more ...

Productionizing TurboQuant on AMD GPUs for KV-Cache-Bound LLM Inference

11 June 2026

^{*The first three authors (Chakrabarti, Limpus, Rana) contributed equally to this work.}

Read more ...

Low Kruskal-Rank Adaptation

11 June 2026

In this blog, you will explore how to enhance Low-Rank Adaptation (LoRA) which uses matrix rank, and replace it with Kruskal rank for efficient training. LoRA is one of the most widely used parameter-efficient fine-tuning (PEFT) methods for adapting pre-trained large language models (LLMs) to downstream tasks. Although LoRA significantly reduces the number of trainable parameters and lowers fine-tuning costs, its performance is often limited by the inherent low-rank assumption. We revisit the notion of rank for LoRA update matrices and show that the standard matrix rank fails to capture duplicated directions and redundancy in the update subspace. Motivated by this analysis, we argue that the Kruskal rank offers a more informative criterion for characterizing update diversity. We therefore propose Low Kruskal Rank Adaptation (LoKRA), a new PEFT algorithm with provable theoretical guarantees that mitigates the limitations of LoRA. We further introduce LoKRA+, an enhanced variant that provides a tighter theoretical lower bound on the Kruskal rank and yields stronger empirical performance. Experiments on multiple LLMs show that our approach consistently outperforms LoRA and other baselines, establishing state-of-the-art performance across a range of benchmarks. The paper is accepted by ICML 2026 (paper link), and the code is publicly available on GitHub.

Read more ...

Dropless MoE Training in JAX with Primus-Turbo

10 June 2026

Mixture-of-Experts (MoE) models have become a standard way to scale a transformer’s parameter count without paying the full compute bill — but training them efficiently on GPUs forces an uncomfortable trade-off. The default path in JAX/MaxText keeps every expert’s tensors at a fixed shape and simply drops the tokens that overflow each expert’s capacity, trading model quality for speed. The fully dropless alternative keeps every token, but in pure JAX it hits a memory wall that makes it impractical at production scale.

Read more ...

ORBIT-2 based Weather and Climate Downscaling and Downscaled Global Forecasts on AMD Instinct

08 June 2026

Advances in complex modeling, collection of data, and computational capacity over the past several decades have resulted in accurate numerical weather prediction (NWP) models that are run every day as part of operations of large weather agencies [1]. In the past few years, data-driven AI models have emerged as an alternative to classes of NWP, making predictions at similar (or even slightly better) skill levels [2] [3] [4], with drastically lower computational costs at inference-time, effectively democratizing access to weather forecasting. The AI models have been most successful at a class of synoptic models: global weather prediction models at resolutions of \(10-30~\rm{km}\) with lead times starting from \(6-12~\rm{hours}\), because such models could be trained on \(\sim 40\) years of well-curated data such as the ERA5 reanalysis archives [8] of the European Centre for Medium-Range Weather Forecasts (ECMWF). In previous blog posts, we have discussed inference using SOTA synoptic AI models and training such models.

Read more ...

Adapting AIM LLMs For Specific Use Cases Through Fine-Tuning in AMD AI Workbench

03 June 2026

In this blog, you will learn how to fine-tune a pre-trained Large Language Model (LLM) with AMD AI Workbench without writing a single line of code and then deploy it using AMD Inference Microservices (AIMs). Rather than training a model from scratch, fine-tuning allows you to adapt a pre-trained model to your specific use case. In addition, AIMs provide standardized, portable inference microservices for serving AI models. AIMs abstract away the complexities involved in model serving by providing an intelligent orchestration layer that automatically configures runtime environments, detects available accelerators, and selects an optimized performance profile (configuration parameters for the inference engine).

Read more ...

Performance Profiling on AMD GPUs - Part 4: Fortran OpenMP Offload Edition

01 June 2026

This blog, like the previous articles in the profiling guide series (Part 1, Part 2, and Part 3), is designed to help you systematically analyze and improve the performance of your Fortran OpenMP offload applications running on AMD GPUs. This guide builds upon the foundational skills from the previous articles and introduces profiling techniques specifically tailored for Fortran applications that use OpenMP target offloading.

Read more ...

Out-of-the-Box ROLL Support on AMD GPUs: Accelerating Reinforcement Learning at Scale

01 June 2026

Reinforcement learning (RL) is rapidly becoming a foundational technology for Large Language Models (LLMs)—powering key abilities such as reasoning and agentic behaviors. As RL workloads grow more complex and computationally intensive, the ecosystem increasingly depends on scalable, high-performance frameworks that can fully utilize modern GPU clusters.

Read more ...

Running Variational Quantum Eigensolver with Qiskit Aer on AMD Instinct

29 May 2026

Quantum computing offers a fundamentally different approach to computational problems by leveraging quantum mechanical properties such as superposition and entanglement. Unlike a classical bit, which is always 0 or 1, a qubit can exist in a superposition of both, and in principle this gives a significant resource advantage: \(n\) qubits represent a state that would otherwise require \(2^n\) complex numbers on a classical computer. However, current quantum hardware is still in its early stages - noise and limited qubit counts constrain the scale of problems it can handle reliably. GPU-accelerated simulators efficiently emulate quantum circuits on classical hardware, though they inherit the same exponential memory cost and become impractical past a few dozen qubits. Of course, any problem whose quantum circuit can be fully simulated on classical hardware can also be handled with other methods that avoid the simulation overhead, but the real value of circuit simulation is the opportunity to develop, validate, and benchmark quantum algorithms in a controlled setting where exact solutions are known, so that the same algorithms can be trusted on future hardware tackling problems that remain intractable at scale today.

Read more ...

Enabling Speculative Speculative Decoding on MI300X

29 May 2026

Speculative speculative decoding (SSD) [1] is a recently proposed speculative decoding (SD) algorithm that further accelerates large language model (LLM) inference beyond conventional SD. In standard SD, a small draft model proposes several future tokens, and a large target model verifies them in parallel. SD already reduces the cost of purely autoregressive decoding, but it still contains a sequential dependency: the next draft step cannot start until the current verification step finishes.

Read more ...

Deep Dive Into 4-Wave Interleave FP8 GEMM

27 May 2026

Our previous two posts in this GEMM optimization series covered Matrix Core instructions and 8-wave ping-pong FP8 GEMM design. Here we discuss another algorithm design introduced by HipKittens - 4-wave interleave, which further improves the performance of the 8-wave ping-pong implementation. For the most complete understanding, we recommend reading this post alongside the source code.

Read more ...

AI Inference on AMD Ryzen™ AI Max Processor

25 May 2026

Local large language model (LLM) inference has rapidly evolved, but a persistent limitation remains: model size is constrained by available GPU memory. Discrete GPUs typically offer 8–24 GB of dedicated VRAM, which can limit the size of models that can run without incurring significant quality loss from aggressive quantization. As frontier open-weight models grow past 70B and 100B parameters, this gap is forcing more developers toward multi-GPU rigs or paid cloud endpoints just to evaluate a single checkpoint.

Read more ...

From Naive to Near-Peak: Building High-Performance GEMM Kernels with Gluon

22 May 2026

On a single MI355, our most-optimized FP16 GEMM kernel runs at 99% MFMA efficiency — the matrix engine sits idle for a handful of cycles per loop. Getting there took ten versions, a regression along the way, and a profiler open for the whole time. This post is a tour of that path: from a 520 TFLOPS naive baseline to a 1489 TFLOPS near-peak kernel (~3× speedup), then the same design carried forward to BF8 (3257 TFLOPS, 99.72%) and MXFP4 (5255 TFLOPS, 92.41%) for low-precision AI workloads.

Read more ...

From Build to Benchmark: ONNX Model Serving with Triton Inference Server on AMD GPUs

22 May 2026

Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained models from multiple frameworks, including ONNX Runtime, TensorFlow, PyTorch, and others. It runs across cloud, data center, and edge environments, making it adaptable for diverse AI workloads.

Read more ...

ROCm 7.13: Expanding Hardware, Tools, and Reach

20 May 2026

AMD released ROCm Core 7.13, the AMD GPU Driver 31.30, and AMD GPU Virtualization 9.0. With these releases, ROCm software expands hardware support across enterprise datacenters. The platform introduces AMD’s latest Instinct accelerators, enables GPU virtualization on VMware ESXi and KVM, and delivers optimized performance for generative AI and large language models. The developer experience has been refined with streamlined profiling tools and open-source visibility into low-level performance analysis. As a result, AI development has become more practical across a broader range of hardware, spanning the latest AMD Instinct hardware and newly supported virtualized GPU partitioning modes.

Read more ...

QuickReduce FP4 Quantization and Benchmarking on MI355

20 May 2026

Large Language Models (LLMs) typically contain billions — or even tens of billions — of parameters. During inference, tensor parallelism is commonly employed to distribute the workload across multiple GPUs. This approach demands frequent, large-scale data synchronization between layers, introducing significant communication latency and placing enormous pressure on interconnect bandwidth.

Read more ...

Diffusion-based Atmospheric Downscaling on AMD Instinct GPUs

20 May 2026

In addition to forecasting, reconstruction is also a commonplace procedure in atmospheric and climate sciences. Reconstruction in this context generally means estimating and infilling missing data, most often due to lack of sensor coverage. The data could be missing spatially or temporally, that is from some regions or at certain times respectively. The data could also be present at some scales and missing at some other scales – for example satellite ice coverage data would have larger scales present but be missing smaller scales due to resolution limitations. Reconstructing these missing smaller scales is called downscaling in climate sciences.

Read more ...

Semantic Fencing of Video Streams Using Embedding Splits from Vision Foundation Models

15 May 2026

In this blog, we present a novel approach for semantically splitting vision datasets into training, validation, and test sets. Instead of relying on ad hoc metadata rules or random shuffles, we use embeddings to reason directly about similarity in latent space and construct splits that better reflect true generalization.

Read more ...

Further Accelerating Kimi-K2.5 on AMD Instinct™ MI325X: W4A8 & W8A8 Quantization with AMD Quark

14 May 2026

In our previous blog [7], we demonstrated how to accelerate Kimi-K2.5 [1] inference on AMD Instinct™ GPUs by profiling the model, identifying fused_moe as the dominant bottleneck (consuming 88–90% of GPU time), and replacing the default Triton-based kernel with a FlyDSL [2]-powered mixed-precision (BF16 + W4A16) fused MoE implementation.

Read more ...

Accelerating ComfyUI Workflows on AMD Instinct™ MI355X GPUs with ROCm

11 May 2026

ComfyUI is an open-source, graphical node-based interface for building generative AI workflows using diffusion models. With over 100,000 stars on GitHub, it has become one of the most widely adopted tools for text-to-image, text-to-video, and image-to-3D generation. You can build workflows by connecting nodes in a drag-and-drop visual interface (no coding required). A large community contributes custom nodes and workflow templates, making ComfyUI a versatile front-end for models ranging from 12B to 27B parameters. For background and setup across AMD platforms, see the earlier ROCm blogs Running ComfyUI on AMD Instinct, Getting Started with ComfyUI on AMD Radeon™ RX 9000 Series GPUs, and Running ComfyUI in Windows with ROCm on WSL.

Read more ...

vLLM-ATOM: Unlocking Native AMD Performance in the vLLM Ecosystem

07 May 2026

This blog walks you through vLLM-ATOM, the AMD-optimized plugin that supercharges vLLM on Instinct GPUs.

Read more ...

AMD-Powered 3D Gaussian Splatting for Autonomous Driving Scenes

07 May 2026

3D Gaussian Splatting (3DGS) is an innovative, explicit scene representation and rendering technique. It reconstructs photorealistic 3D environments from a set of images of the scene from a variety of angles. It represents a scene as a vast, learnable collection of 3D Gaussians, which are optimized with backpropagation using a differentiable rasterizer. This pipeline enables real-time novel view synthesis of the scene – generating images of the scene from previously unseen angles. It also permits easy scene editing by moving, copying, recolouring, etc. parts of the reconstructed 3D structure.

Read more ...

Accelerating Mixture-of-Experts Execution with FarSkip-Collective Models

05 May 2026

Whether you are running training or inference, the largest Mixture-of-Experts (MoE) based LLMs cannot fit on a single GPU; instead you must run collective-communication operations to integrate the work of multiple GPUs to work together on a single model.

Read more ...

TraceLens: Democratizing AI Performance Analysis

27 April 2026

Profiling modern AI workloads produces huge traces that are hard to interpret. Framework profilers record thousands of operations, kernels, and communication events, and engineers often end up staring at tools like Perfetto UI doing manual calculations. TraceLens speeds this up: it consumes existing framework traces and turns them into structured summaries and comparisons, allowing you to move on to the actual diagnosis and optimization.

Read more ...

Styled Text Image Generation with Eruku on AMD

24 April 2026

Producing text images where text is both readable and controllable while faithfully matching a target visual style is a challenging problem. It has broad applications ranging from synthetic handwritten text generation to graphic design. In these settings, you need more than plausible images; you need precise control over both text content and visual fidelity. This is where Eruku [1] stands out.

Read more ...

Primus Projection: Estimate Memory and Performance Before You Train

24 April 2026

Error parsing meta tag attribute “keywords”: No content.

Read more ...

Getting Started with FlyDSL Nightly Wheels on ROCm

20 April 2026

In the previous post on FlyDSL, we introduced the motivation behind FlyDSL and how it enables Python-native GPU kernel development using the AMD ROCm™ software stack. FlyDSL combines the flexibility of Python with the performance of MLIR and LLVM-based compilation, allowing developers to write GPU kernels in Python while targeting modern AMD hardware.

Read more ...

FLy: A New Paradigm for Speculative Decoding — Accepting Semantically Correct Drafts Beyond Exact Match

20 April 2026

Speculative decoding has emerged as a highly effective approach to accelerate large language model (LLM) inference, yet existing methods are severely bottlenecked by a rigid exact-match verification rule that discards many semantically valid continuations. Furthermore, existing training-based loose decoding methods often suffer from significant performance degradation on out-of-distribution (OOD) tasks.

Read more ...

Introduction to profiling tools for AMD hardware

10 April 2026

Getting a code to be functionally correct is not always enough. In many industries, it is also required that applications and their complex software stack run as efficiently as possible to meet operational demands. This is particularly challenging as hardware continues to evolve over time, and as a result codes may require further tuning. In practice, many application developers construct benchmarks, which are carefully designed to measure the performance, such as execution time, of a particular code within an operational-like setting. In other words: a good benchmark should be representative of the real work that needs to be done. These benchmarks are useful in that they provide insight into the characteristics of the application, and enable one to discover potential bottlenecks that could result in performance degradation during operational settings.

Read more ...

Serving CTR Recommendation Models with Triton Inference Server using the ONNX Runtime Backend

07 April 2026

In a previous ROCm blog post, “Triton Inference Server with vLLM on AMD GPUs”, deploying large language models using Triton Inference Server with the vLLM backend on ROCm-enabled AMD GPUs was introduced. In this blog, you will explore the ONNX Runtime and Python backends in the ROCm build of Triton Inference Server, along with an upgrade that aligns the build with the latest upstream Triton Inference Server release. You will also see how these enhancements expand AI model deployment capabilities and highlight the performance advantages of AMD Instinct GPUs using a representative recommendation model.

Read more ...

FlashInfer on ROCm: High‑Throughput Prefill Attention via AITER

06 April 2026

The explosive growth of large language models (LLMs) like DeepSeek-R1, Llama 3, and Qwen 3 has created an urgent need for efficient inference solutions. As these models scale to billions of parameters and context lengths extend to hundreds of thousands of tokens, the attention mechanism becomes a critical bottleneck, consuming substantial memory for key-value (KV) caches and requiring significant compute for each token generated.

Read more ...

Customizing Kernels with hipBLASLt TensileLite GEMM Tuning - Advanced User Guide

06 April 2026

Optimizing General Matrix Multiply (GEMM) operations is critical for maximizing the efficiency of AI models on AMD hardware. In our previous blog posts, we explored Offline Tuning, a method for selecting the best-performing kernel from an existing solution pool. For detailed instructions on using hipBLASLt-bench, please refer to hipBLASLt offline tuning part 1 and part 2. Additionally, for a streamlined experience, check out the Day 0 Developer Guide: hipBLASLt Offline GEMM Tuning Script which covers one-click offline tuning. Furthermore, for scenarios requiring dynamic runtime adaptation, developers can explore our recently published blog on hipBLASLt Online GEMM Tuning.

Read more ...

Deploy and Customize AMD Solution Blueprints

02 April 2026

12 May 2026

AMD Solution Blueprints are ready-to-deploy, customizable reference applications built with AMD Inference Microservices (AIMs). They offer a microservice solution for a range of use cases, from standard chat interfaces to agentic frameworks, serving as both starting points for development and example implementations.

Read more ...

Reproducing the AMD MLPerf Inference v6.0 Submission Result

01 April 2026

MLPerf Inference v6.0 marked AMD’s fourth round of submissions to MLPerf Inference. This blog provides a step-by-step guide to reproducing AMD’s results on different vendor systems.

Read more ...

AMD Instinct™ GPUs MLPerf Inference v6.0 Submission

01 April 2026

The results for the MLPerf Inference v6.0 benchmark were released on April 1st 2026. In this round, AMD showcased the performance of the MI355X system, as well as the capability and versatility of the ROCm software stack.

Read more ...

Training a Robotic Arm Using MuJoCo and JAX on AMD Hardware with ROCm™

31 March 2026

Training a robotic arm to pick up an object and place it somewhere else may sound straightforward, but teaching a robot to do this reliably in the real world is one of the harder problems in robotics. Traditional approaches rely on hand-tuned motion planning and carefully scripted control logic, which is brittle and time-consuming to maintain as environments change.

Read more ...

Leveraging AMD AI Workbench and Autoscaling to Scale LLM Inference for Optimal Resource Utilization

31 March 2026

08 May 2026

Explore how autoscaling with AMD Inference Microservices (AIMs) and AMD AI Workbench can automatically scale your resources in response to shifting AI workload demand. AI inference can be computationally intensive, with resource requirements that vary depending on traffic e.g., the number of inference requests your workload receives at any given time. Autoscaling addresses this by scaling resources up during peak traffic to maintain performance, and scaling them back down during quieter periods to reduce cost and resource consumption.

Read more ...

Programming Tensor Descriptors in Composable Kernel (CK)

25 March 2026

Writing efficient GPU kernels requires more than knowing the API—it demands a deep understanding of the underlying concepts, from GPU architecture to low-level programming patterns. This blog series demystifies GPU kernel programming on AMD GPUs by breaking down common kernels into their fundamental building blocks. Rather than treating GPU programming as a black box, each blog focuses on a specific concept, starting from first principles and building up to complete implementations with simple, insightful example code. In this blog, you will learn one of the most fundamental concepts in Composable Kernel (CK): the TensorDescriptor—a powerful abstraction for managing multi-dimensional data layouts and transformations. By the end of this series, you will be able to not only understand existing GPU kernels but also design and optimize your own.

Read more ...

GROMACS on AMD Instinct GPUs: A Complete Build Guide

24 March 2026

Molecular dynamics simulations power breakthroughs in drug discovery, materials science, and computational biology. GROMACS stands as one of the most widely used molecular dynamics engines, and pairing it with AMD’s latest GPU accelerators unlocks exceptional simulation throughput. This guide walks you through installing a complete GROMACS stack with OpenMPI support on AMD MI300X and MI355X systems — whether you’re deploying on bare metal or in containers.

Read more ...

Engineering Qwen-VL for Production: Vision Module Architecture and Optimization Practices

24 March 2026

Vision–language models (VLMs) have rapidly evolved from research prototypes into foundational components of modern AI systems, enabling unified reasoning over images, videos, and text. As model scale and application complexity increase, the focus of VLM development has shifted from isolated benchmark performance toward architectural efficiency, multimodal alignment, and production readiness. Within this landscape, Qwen-VL stands out as a practical and extensible vision–language model that emphasizes modular visual encoding, flexible multimodal integration, and scalability in real-world deployments. Rather than treating vision as a peripheral add-on, Qwen-VL adopts a tightly integrated design that allows visual representations to participate deeply in language reasoning, making it particularly well suited for both large-scale inference and domain-specific customization.

Read more ...

Accelerating Kimi-K2.5 on AMD Instinct™ MI300X: Optimizing Fused MoE with FlyDSL

24 March 2026

With the recent surge in popularity of OpenClaw [1], its officially recommended model, Kimi-K2.5 [2], has taken the AI community by storm. As developers and researchers flock to this powerful Mixture-of-Experts (MoE) LLM, the need for high-performance inference on cutting-edge hardware has never been more critical.

Read more ...

Edge-to-Cloud Robotics with AMD ROCm: From Data Collection to Real-Time Inference

23 March 2026

This blog walks through a full edge-to-cloud robotics AI solution, built entirely on the AMD ecosystem and the Hugging Face LeRobot framework. In case you are not familiar, LeRobot is an open source platform from Hugging Face that provides pre-trained models, datasets, and tools for real-world robotics using PyTorch.

Read more ...

AMD Device Metrics Exporter v1.4.2: Enhanced Observability, Deeper RAS Insights, and Smarter GPU Telemetry for Modern HPC & AI Clusters

23 March 2026

Modern GPU‑accelerated systems—whether powering massive AI training workloads or tightly scheduled HPC environments—depend heavily on high‑quality telemetry. Understanding how each GPU behaves under load, how often it hits power or thermal boundaries, and how reliably the hardware performs is central to maintaining performance, diagnosing failures, and tuning systems at scale.

Read more ...

hipBLASLt Online GEMM Tuning

19 March 2026

This blog post introduces the integration of hipBLASLt Online GEMM Tuning into LLM frameworks, illustrated through an example implementation of RTP-LLM. Developed by the AMD Quark Team, hipBLASLt Online Tuning provides a user-friendly approach to improving GEMM performance by enabling runtime tuning without requiring additional offline tuning steps.

Read more ...

Utilizing AMD Instinct GPU Accelerators for Weather and Precipitation Forecasting with NeuralGCM

19 March 2026

In recent years, the landscape of weather forecasting has evolved tremendously, employing cutting-edge AI technologies to enhance prediction accuracy and speed. In previous blogs, we have demonstrated how to run several state-of-the-art AI weather forecasting models, such as Pangu-Weather, GenCast, and Aurora. Following that, this blog focuses on emerging trends in weather forecasting models, particularly the innovative NeuralGCM, which melds the strengths of General Circulation Models (GCMs) and Machine Learning (ML). We will briefly outline the design of NeuralGCM and its hybrid approach for weather and precipitation forecasting. We will then go through the required environments, installation steps, and the inference process for generating forecasts and creating plots to compare the outputs to the ground truth provided by ERA5 data.

Read more ...

Multi-Node Distributed Inference for Diffusion Models with xDiT

18 March 2026

^{The first two authors (Lehtiranta, Kemppi) contributed equally to this work.}

Read more ...

GROMACS Performance on AMD Instinct MI355X

13 March 2026

02 June 2026

Are you planning a hardware upgrade for your molecular dynamics workflows? In this blog, we benchmark GROMACS on AMD’s latest Instinct MI355X GPU and compare it head-to-head with the MI300X, demonstrating significant throughput improvements that accelerate time-to-results for life-science research. You will see exactly how much faster MI355X runs the standard ADH dodec benchmark across 1 to 8 GPUs. Use these results to make informed decisions about your next HPC deployment.

Read more ...

FP8 GEMM Optimization on AMD CDNA™4 Architecture

10 March 2026

This blog post continues our previous blog Matrix Core Programming on AMD CDNA™3 and CDNA™4 Architecture, which introduced Matrix Cores and demonstrated how to use them in HIP kernels.

Read more ...

Getting Started with ComfyUI on AMD Radeon™ RX 9000 Series GPUs

09 March 2026

ComfyUI has become a widely adopted and versatile node-based interface for Stable Diffusion and other generative AI models, gaining significant traction within the AI content creation community. Unlike traditional web-based interfaces, ComfyUI provides a node-based workflow system that gives users complete control over their image and video generation pipelines. Its modular architecture allows for complex workflows involving multiple models, LoRAs, ControlNets, and custom processing steps.

Read more ...

Agentic Diagnosis for LLM Training at Scale

09 March 2026

In MaxText-Slurm: Production-Grade LLM Training with Built-In Observability, we introduced MaxText-Slurm — an open-source launch system and observability stack for running MaxText LLM training on AMD Instinct GPU clusters. We showed how a unified Prometheus time-series database (TSDB) collects GPU, host, network, and training metrics into a single queryable store, persisted to disk so that no data is lost even if the job crashes.

Read more ...

HPC Coding Agent - Part 3: MCP Tool for Profiling

06 March 2026

In this blog, we build an AI agent specialized in profiling and optimizing GPU-accelerated applications within High-Performance Computing (HPC) environments. Using open-source tools, we create a state-of-the-art agent and enhance its profiling capabilities through a custom Model Context Protocol (MCP) server. This server provides the agent with tools to leverage AMD’s profiling utilities for analyzing application performance on AMD GPUs.

Read more ...

Fine-Tuning AI Surrogate Models for Physics Simulations with Walrus on AMD Instinct GPU Accelerators

06 March 2026

Physics simulations are used for studying complex systems and are essential where experiments are difficult, expensive, or impossible. In our context, a simulation means numerically solving mathematical equations that are believed to describe a physical system and evolving them forward in time on a computer. They enable controlled exploration of physical behavior for science and engineering, but at a high computational cost, which in most cases increases rapidly with scale. Our focus is on continuum dynamics, where the system is represented by fields such as density, velocity, or temperature, defined on a grid and evolving over time. High-resolution physics simulations are slow to run, sensitive to numerical error and impractical for large parameter spaces. Surrogate models address these limitations by learning to approximate simulation dynamics directly from data. Once trained, they can produce fast predictions at a fraction of the cost, giving researchers the ability to rapidly explore parameter space and generate long rollouts.

Read more ...

Ensemble High-Resolution Weather Forecasting on AMD Instinct GPU Accelerators

06 March 2026

Weather prediction is fraught with uncertainty, as is the inference of any real-world phenomena dependent on physical observations. The consequence is that any estimated current state of the atmosphere as well as any forecast both carry a level of uncertainty. As such, any weather forecasting model, whether AI or traditional, needs to produce reasonable outputs despite the inherent uncertainty of inputs, and, if possible, quantify the uncertainty of the outputs for the user in some practical fashion.

Read more ...

HPC Coding Agent - Part 2: An MCP Tool for Code Optimization with OpenEvolve

04 March 2026

Large language models (LLMs) and LLM-driven agents (AI agents) are already trained on a massive amount of data where a considerable portion consists of code, and both models and agentic coding services are developed specifically for the purpose of coding. For users who want to optimize their code for certain purposes, for example runtime or memory efficiency, LLMs may produce plausible solutions, but these are often not optimal.

Read more ...

Streamlining Recommendation Model Training on AMD Instinct™ GPUs

02 March 2026

Recommendation model training and inference workloads represent a significant portion of computational requirements across industries including e-commerce, social media and content streaming platforms. Unlike LLMs, recommendation models result in to complex and often imbalanced communication across GPUs, along with a higher load on the CPU-GPU interconnect. The ROCm training docker [1] now includes essential libraries for recommendation model training. This blog demonstrates the functionality and ease of training recommendation models using ROCm, along with suggestions for improved configuration of these workloads. We also highlight the inherent benefits of the large HBM size on AMD Instinct™ GPUs for recommendation workloads.

Read more ...

MaxText-Slurm: Production-Grade LLM Training with Built-In Observability

02 March 2026

Training large language models (LLMs) at scale on GPU clusters is not just a compute problem — it is an operations problem. Launching multi-node distributed training, keeping it running reliably, and diagnosing failures when they happen all require tooling that most training frameworks do not provide. MaxText-Slurm is an open-source launch system and observability stack that bridges this gap for MaxText on AMD Instinct GPU clusters managed by Slurm.

Read more ...

Exploring Use Cases for Scalable AI: Implementing Ray with ROCm 7 Support for Efficient ML Workflows

27 February 2026

This blog builds on insights from our previous blog post, which introduced Ray 2.48.0.post0 running on ROCm 6.2 and demonstrated Reinforcement Learning from Human Feedback (RLHF) with verl 0.3.0.post0 and vLLM 0.6.4 on AMD GPUs. In this follow‑up, we introduce Ray 2.51.1 with ROCm 7.0.0, verl 0.6.0, and vLLM 0.11.0.dev, highlighting the new performance benefits and capabilities for large‑scale RLHF workloads.

Read more ...

PyTorch Offline Tuning with TunableOp

24 February 2026

In an earlier blog post, we explored how PyTorch TunableOp can potentially accelerate models through online tuning - where during model execution, PyTorch benchmarks and selects optimal BLAS kernels. While online tuning is effective, it introduces overhead due to the time needed to execute the ML model from end-to-end. If this is done once, the overhead may be acceptable, but for repeated tuning it may be cost-prohibitive to keep re-running the model.

Read more ...

LuminaSFT: Generating Synthetic Fine-Tuning Data for Small Language Models

24 February 2026

Small language models (SLMs) are emerging as a lightweight and cost-efficient alternative to large language models (LLMs). They significantly reduce inference costs and latency, and when carefully optimized for specific tasks, can approach—or even match—the performance of larger models. However, due to their limited parameter capacity, SLMs typically require stronger supervision to reach their full potential. Supervised fine-tuning (SFT) therefore plays a crucial role in enhancing their performance.

Read more ...

JAX-AITER: Bringing AMD’s Optimized AI Kernels to JAX on ROCm™

24 February 2026

If you’re building large models in JAX on AMD GPUs, you want fast, reliable kernels without spending weeks tuning them yourself. That’s exactly the need that led us to create JAX-AITER.

Read more ...

Getting Started with AMD Resource Manager: Efficient Sharing of AMD Instinct™ GPUs for R&D Teams and AI Practitioners

24 February 2026

In this blog, you will learn how to use AMD Resource Manager and its components for centralized AI infrastructure governance. It’s part of the AMD Enterprise AI Suite, a full-stack solution for developing, deploying and running AI workloads on a Kubernetes platform designed to support AMD compute. The AMD Resource Manager provides a user-friendly graphical user interface (GUI) and Command Line Interface (CLI) with a unified control plane that simplifies tasks such as managing compute clusters, user access, monitoring resource utilization, and allocating the right compute quotas to the right projects.

Read more ...

Primus-Pipeline: A More Flexible and Scalable Pipeline Parallelism Implementation

23 February 2026

Error parsing meta tag attribute “keywords”: No content.

Read more ...

FlyDSL: Expert GPU Kernel Development with the Ease of MLIR Python Native DSL on AMD GPUs

20 February 2026

30 March 2026

The AMD ROCm™ software ecosystem continues to grow rapidly as developers build new kernels, compilers, and AI frameworks optimized for AMD GPUs. As workloads become more complex and the demand for both performance and agility increases, a clear need has emerged for a modern, flexible, and open GPU kernel authoring framework.

Read more ...

Introducing hipThreads: A C++ - Style Concurrency Library for AMD GPUs

19 February 2026

09 April 2026

In this blog, you will learn how to accelerate C++ code developed for the CPU to run on AMD GPUs using hipThreads by incrementally porting familiar std::thread patterns to GPU-resident hip::thread code. We walk through a step-by-step SAXPY example, explain key concepts like persistent threads and fibers, and share real performance results to help you evaluate when this model fits your workload.

Read more ...

Unlocking Sparse Acceleration on AMD GPUs with hipSPARSELt

17 February 2026

Sparse computation is a cornerstone of modern AI acceleration. As models like LLaMA and DINOv2 ViT-L scale in size and complexity, the demand for efficient matrix operations becomes increasingly critical. To address this, semi-structured sparsity, also known as the 2:4 structured sparsity pattern, has emerged as a powerful optimization technique.

Read more ...

Advanced MXFP4 Quantization: Combining Fine-Tuned Rotations with SmoothQuant for Near-Lossless Compression

17 February 2026

As language models continue to grow in popularity, reducing the cost of inference and accelerating model serving have become key challenges. Quantization offers a powerful solution by reducing the model size and leveraging inexpensive math operations, for example, using low-bitwidth formats like OCP MXFP4 (4.25 bits) available in AMD Instinct MI350X and MI355X accelerators.

Read more ...

Adaptive Top-K Selection: Eliminating Performance Cliffs Across All K Values on AMD GPUs

17 February 2026

Top-K selection is critical for LLMs and RAG workloads, yet standard Radix Sort implementations often suffer from performance cliffs at small K values due to fixed initialization overheads. In our AITER library (introduced in our previous blog [1]), we originally utilized an 11-bit radix sort for Top-K selection. While this approach excels at scale, we identified a critical efficiency gap for the lightweight filtering often required during modern inference.

Read more ...

Elevate Your LLM Inference: Autoscaling with Ray, ROCm 7.0.0, and SkyPilot

13 February 2026

This blog explores autoscaling of inference workloads in Ray Serve with a vLLM backend on AMD Instinct™ GPUs for large language models (LLMs). Furthermore, you will learn how to scale beyond a single cluster using SkyPilot, which enables multicloud scaling for Ray Serve. Combined with the AMD ROCm™ software platform, this creates a unified, cloud-agnostic platform that scales distributed LLM inference from single-GPU to multi-cluster deployments.

Read more ...

Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm 7.0.0

12 February 2026

In our previous blog post, we introduced Volcano Engine Reinforcement Learning for LLMs (verl) 0.3.0.post0 with ROCm 6.2 and vLLM 0.6.4. In this blog post, we will provide you with an overview of verl 0.6.0 with ROCm 7.0.0 and vLLM 0.11.0.dev and its benefits for large-scale reinforcement learning from human feedback (RLHF). You will also learn about the modifications made to optimize verl performance on AMD Instinct™ MI300X GPUs. Next, you will walk through building the Docker image on your system, along with training scripts for single-node and multi-node setups. Lastly, we provide you with verl performance results, focusing on throughput and convergence accuracy achieved on AMD Instinct MI300X GPUs. Follow this guide to get started with verl on AMD Instinct GPUs and accelerate your RLHF training with ROCm-optimized performance.

Read more ...

Solution Blueprints: Accelerating AI Deployment with AMD Enterprise AI

11 February 2026

AMD Enterprise AI Suite standardizes the inference layer with AMD Inference Microservices (AIMs), a set of containers for optimized model serving on AMD Instinct™ GPUs with validated profiles and OpenAI-compatible APIs. However, production grade agentic and generative AI applications need more than inference endpoints. You need document loaders, embedding pipelines, vector databases, RAG logic, agent orchestration, and user interfaces. These components need to be wired together with proper Kubernetes resource definitions, GPU allocation, service discovery, and configuration management. This blog walks through the technical implementation of Solution Blueprints: how they’re structured, how they use Helm application charts for code reuse, and the patterns they demonstrate for multi-container orchestration. While the Enterprise AI Suite Overview covers the platform and the AIMs blog covers inference, this post focuses on application architecture and deployment patterns.

Read more ...

Digital Twins on AMD: Building Robotic Simulations Using Edge AI PCs

09 February 2026

Digital twins are becoming a core tool in robotics, automation, and intelligent systems. They provide a virtual representation of a physical system, allowing developers to validate robot behaviors, test motion strategies, and generate datasets before deploying anything in the real world.

Read more ...

Building Robotics Applications with Ryzen AI and ROS 2

09 February 2026

This blog showcases how to deploy power-efficient Ryzen AI perception models with ROS 2 - the Robot Operating System. We utilize the Ryzen AI Max+ 395 (Strix-Halo) platform, which is equipped with an efficient Ryzen AI NPU and iGPU. The Ryzen AI CVML Library is used to deploy supported models efficiently on the Ryzen AI platform. All of the code is available on GitHub in the AMD Ryzers repository and was originally presented at ROSCon’25.

Read more ...

Resilient Large-Scale Training: Integrating TorchFT with TorchTitan on AMD GPUs

08 February 2026

Training large AI models on AMD GPUs demands unwavering stability and robust fault-tolerance capabilities at cluster scale. Yet today’s ROCm-based multi-node GPU deployments often rely on brittle checkpoint-and-restart mechanisms to recover from failures. This approach wastes precious compute cycles and slows down training as model sizes and cluster scales grow. To address these challenges, we integrated PyTorch’s native fault-tolerance framework—TorchFT—with the TorchTitan training framework on AMD’s Primus-SaFE Kubernetes platform, achieving resilient, checkpoint-less training at hundred-GPU scale. This blog builds upon our previous work on the Primus ecosystem—for background on the platform architecture, see our earlier posts on Primus-SaFE, the Primus training framework, and training large models with Primus.

Read more ...

Accelerating Graph Layout with AI and ROCm on AMD GPUs

06 February 2026

Learn how easy it is to implement established graph algorithms, and deploy them on AMD GPUs with immediate performance improvements, using AI as a coding partner!

Read more ...

Micro-World: First AMD Open-Source World Models for Interactive Video Generation

05 February 2026

World models aim to simulate aspects of the real world, enabling more effective training and exploration of AI agents and ultimately paving the way toward richer forms of digital life. Games can be viewed as another form of world simulation, and their data is relatively easy to collect and annotate, making them a natural playground for building and studying world models. GameNGen [1] has demonstrated the potential of this direction, while works such as GameFactory [2], Matrix-Game [3], and Hunyuan-GameCraft [4] further showcase strong performance in game-oriented world modeling. However, these projects are either fully closed-sourced or release only partial components (typically inference-only), which limits reproducibility and community-driven progress.

Read more ...

Foundations of Molecular Generation with GP-MoLFormer on AMD Instinct MI300X Accelerators

03 February 2026

Nearly every technological breakthrough we celebrate begins with a material that did not exist before someone imagined it. Modern computing rests on engineered semiconductors, energy storage depends on carefully designed electrolytes, and sustainable technologies increasingly rely on alternatives to scarce or environmentally costly rare earth elements. Designing such materials with specific properties at scale is one of the most challenging and consequential problems in science.

Read more ...

Debugging NaN Results in CK Tile GEMM: A rocgdb Detective Story

30 January 2026

When developing high-performance GPU kernels, subtle bugs can lead to catastrophic failures like NaN (Not-a-Number) outputs. This post chronicles our journey of debugging a tricky NaN issue in AMD’s Composable Kernel (CK) Tile GEMM implementation using rocgdb. What started as mysterious NaN outputs ended with discovering a single-character typo that corrupted the data distribution.

Read more ...

ROCm 7.2: Smarter, Faster, and More Scalable for Modern AI Workloads

22 January 2026

Modern AI workloads demand more than raw compute—they require a tightly integrated software stack that can extract maximum performance, scale efficiently across systems, and operate reliably in production environments. With the latest ROCm 7.2 release, we’re delivering a broad set of optimizations and software enhancements designed to improve developer productivity, runtime performance, and enterprise readiness.

Read more ...

Nitro-AR: A Compact AR Transformer for High-Quality Image Generation

22 January 2026

Recent years have witnessed remarkable progress in image generation, driven by two major modeling paradigms: diffusion-based models and autoregressive (AR) models. Building upon our previously released Nitro-E, a light-weight diffusion model for fast image synthesis, this blog explores a complementary direction, applying the architecture in an AR framework.

Read more ...

LLM Inference Optimization Using AMD GPU Partitioning

22 January 2026

As AI and HPC workloads grow in complexity and scale, there’s a rising need for precise GPU resource management, robust memory isolation, and efficient multi-tenant scheduling. AMD’s Instinct™ MI300 series addresses this by offering dynamic partitioning capabilities. These allow a single physical device to be segmented into multiple isolated partitions, each tailored to the needs of specific workloads. This flexibility is particularly beneficial for AI inference tasks, where different models or instances may require distinct resource allocations. Maximizing the utilization of GPU resources while ensuring that each workload operates within its own isolated environment is crucial for performance and reliability.

Read more ...

ROCm Becomes a First-Class Platform in the vLLM Ecosystem

21 January 2026

As the generative AI ecosystem matures, vLLM embraces a multivendor ecosystem. The quality of support across hardware platforms becomes a defining priority: developers expect consistent, high-performance behavior no matter which GPU they choose. Today, we are proud to announce a major realization of that vision: AMD ROCm™ is now a first-class platform in the vLLM ecosystem.

Read more ...

Quickly Developing Powerful Flash Attention Using TileLang on AMD Instinct MI300X GPU

20 January 2026

Against the backdrop of the rapid development of the AMD ROCm™ software ecosystem, the high barrier to operator development has long been a bottleneck. The emergence of TileLang provides developers with an efficient solution. As an emerging AI operator development framework, tilelang encapsulates low-level GPU details with concise syntax, enabling developers to fully tap into the computing potential of AMD GPUs without requiring in-depth knowledge of low-level languages such as HIP. The AMD Instinct™ MI300X GPU, as a flagship GPU for AI workloads, boasts ultra-high bandwidth memory and powerful compute units, but it requires adaptive high-performance operators to unleash its capabilities. In this blog, we will take Flash Attention, a key kernel in both LLM training and inference, as an example to fully demonstrate the development process based on TileLang on the MI300X, highlighting the dual benefits of efficiency and performance that TileLang brings to AMD operator development.

Read more ...

Deep Dive into Primus: High-Performance Training for Large Language Models

15 January 2026

Primus is the AMD unified training framework designed to deliver high-performance, scalable large language models (LLMs) training across multiple backends – including TorchTitan and Megatron-LM. It provides a consistent CLI interface, while each backend ships with carefully optimized configurations for popular open-source models. These backend-specific presets ensure the best out-of-the-box performance on AMD Instinct™ GPUs. In this deep dive, we walk through the best practices for achieving peak performance when training dense LLMs on Primus.

Read more ...

Applying Compute Partitioning for Workloads on MI300X GPUs

14 January 2026

This blog explains how to use AMD GPU compute partitioning to increase throughput, utilization and reduce time-to-results for two different types of workloads:

Read more ...

Reimagining GPU Allocation in Kubernetes: Introducing the AMD GPU DRA Driver

13 January 2026

14 January 2026

In this blog, you’ll learn how Kubernetes’ new Dynamic Resource Allocation (DRA) framework and the AMD GPU DRA Driver turn GPUs into first-class, attribute-aware resources. We’ll walk through how to publish AMD Instinct GPUs via ResourceSlices, request specific models and partition profiles with declarative ResourceClaims, and observe allocations through Kubernetes-native lifecycle objects, so you can simplify cluster operations compared to traditional Device Plugin–based setups.

Read more ...

Installing AMD HIP-Enabled GROMACS on HPC Systems: A LUMI Supercomputer Case Study

12 January 2026

Running molecular dynamics (MD) simulations efficiently is critical for accelerating scientific discovery in many life science use cases, e.g., drug discovery. GROMACS is a widely used, GPU-accelerated molecular dynamics engine powering many life science workflows, but its performance can vary significantly depending on the installation method and hardware configuration. For broader context on GROMACS applications in drug design, see recent research on GROMACS in cloud environments for alchemical drug design.

Read more ...

Athena-PRM: Enhancing Multimodal Reasoning with Data-Efficient Process Reward Models

12 January 2026

This blog introduces Athena-PRM, a multimodal Process Reward Model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. To efficiently generate high-quality process-labeled data, we leverage prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. We also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data.

Read more ...

Using Gradient Boosting Libraries on MI300X for Financial Risk Prediction

08 January 2026

In the world of machine learning, the choice of hardware can significantly impact the performance and efficiency of model training and prediction. Gradient Boosting Machines (GBMs) benefit greatly from GPU parallelization in several key algorithmic steps involving independent, repetitive computations. The most substantial speedup comes from histogram construction and best split searching, as these can be executed in parallel across features and candidate splits using thousands of GPU cores, vastly accelerating tree building. Additionally, the calculation of gradients and Hessians for each data point is naturally parallelizable and well suited to GPU architectures. Other operations—such as leaf value updates, data preprocessing (like quantization and normalization), and batch predictions—can also be distributed efficiently across GPU threads. By exploiting parallelism in these stages, GPUs dramatically reduce training and prediction time for GBMs, making them ideal for large datasets or scenarios where quick model iteration is crucial.

Read more ...

Introducing the AMD Network Operator v1.0.0: Simplifying High-Performance Networking for AMD Platforms

08 January 2026

In this blog, you will learn how the AMD Network Operator simplifies high-performance networking for AMD GPU clusters, automates NIC discovery and configuration, supports RDMA/RoCE workloads, and provides real-time monitoring to keep your AI/ML and HPC jobs running efficiently.

Read more ...

Bridging the Last Mile: Deploying Hummingbird-XT for Efficient Video Generation on AMD Consumer-Grade Platforms

08 January 2026

^{*The first three authors (Isobe, Cui, and Ge) contributed equally to this work.}

Read more ...

High-Resolution Weather Forecasting with StormCast on AMD Instinct GPU Accelerators

07 January 2026

The traditional approach to numerical weather prediction is based on propagating a known atmospheric state forward in time in short steps using systems of partial differential equations directly obtained from physical considerations. A new approach is to use machine learning methods to directly proceed to a later state in one large step, typically moving forward multiple hours in a single jump.

Read more ...

Breaking the Accuracy-Speed Barrier: How MXFP4/6 Quantization Revolutionizes Image and Video Generation

07 January 2026

This blog introduces MXFP4 and MXFP6, the newly supported data types on AMD Instinct™ MI350 Series GPUs, and demonstrates their remarkable quality in image and video generation tasks. By reading this blog, you will discover how these low-bit formats can break the accuracy-speed tradeoff, boosting both efficiency and performance in generative AI workflows.

Read more ...

ROCm MaxText Testing — Decoupled (Offline) and Cloud-Integrated Modes

06 January 2026

In this blog, you will learn how to run MaxText unit tests on AMD ROCm GPUs in two complementary modes: offline (decoupled) and fully cloud-integrated. By the end, you will know when to use each mode, how to interpret the results, and how to fold them into your CI and debugging workflows.

Read more ...

ROCm Fork of MaxText: Structure and Strategy

06 January 2026

In this blog you will explore how the ROCm fork of MaxText is structured and how that structure supports ROCm and fully offline, decoupled workflows across platforms.

Read more ...

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

02 January 2026

In this blog we will discuss SparK, a training-free, plug-and-play method for KV cache compression in large language models (LLMs). By addressing the overlooked redundancy in feature channels and employing a “prune-and-recover” strategy, SparK reduces KV cache storage by over 30% compared to traditional methods while maintaining model accuracy. It offers a robust solution for long-context inference, establishing a new perspective on unstructured sparsity.

Read more ...

Accelerating Multimodal Inference in vLLM: The One-Line Optimization for Large Multimodal Models

02 January 2026

Deploying multimodal models like Qwen3-VL or InternVL at scale reveals a hidden bottleneck. While Tensor Parallelism (TP) is essential for massive language decoders, it is often overkill for vision encoders. These encoders are typically small, often just 1-5% of total model size, so there is limited compute benefit from sharding them. However, they still incur expensive all-reduce communication costs after every single layer.

Read more ...