Posted in 2026
Deploy and Customize AMD Solution Blueprints
- 02 April 2026
AMD Solution Blueprints are ready-to-deploy, customizable reference applications built with AMD Inference Microservices (AIMs). They offer a microservice solution for a range of use cases, from standard chat interfaces to agentic frameworks, serving as both starting points for development and example implementations.
Reproducing the AMD MLPerf Inference v6.0 Submission Result
- 01 April 2026
MLPerf Inference v6.0 marked AMD’s fourth round of submissions to MLPerf Inference. This blog provides a step-by-step guide to reproducing AMD’s results on different vendor systems.
AMD Instinct™ GPUs MLPerf Inference v6.0 Submission
- 01 April 2026
The results for the MLPerf Inference v6.0 benchmark were released on April 1st 2026. In this round, AMD showcased the performance of the MI355X system, as well as the capability and versatility of the ROCm software stack.
Training a Robotic Arm Using MuJoCo and JAX on AMD Hardware with ROCm™
- 31 March 2026
Training a robotic arm to pick up an object and place it somewhere else may sound straightforward, but teaching a robot to do this reliably in the real world is one of the harder problems in robotics. Traditional approaches rely on hand-tuned motion planning and carefully scripted control logic, which is brittle and time-consuming to maintain as environments change.
Leveraging AMD AI Workbench to Scale LLM Inference for Optimal Resource Utilization
- 31 March 2026
Explore how autoscaling with AMD Inference Microservices (AIMs) and AMD AI Workbench can automatically scale your resources in response to shifting AI workload demand. AI inference can be computationally intensive, with resource requirements that vary depending on traffic e.g., the number of inference requests your workload receives at any given time. Autoscaling addresses this by scaling resources up during peak traffic to maintain performance, and scaling them back down during quieter periods to reduce cost and resource consumption.
Programming Tensor Descriptors in Composable Kernel (CK)
- 25 March 2026
Writing efficient GPU kernels requires more than knowing the API—it demands a deep understanding of the underlying concepts, from GPU architecture to low-level programming patterns. This blog series demystifies GPU kernel programming on AMD GPUs by breaking down common kernels into their fundamental building blocks. Rather than treating GPU programming as a black box, each blog focuses on a specific concept, starting from first principles and building up to complete implementations with simple, insightful example code. In this blog, you will learn one of the most fundamental concepts in Composable Kernel (CK): the TensorDescriptor—a powerful abstraction for managing multi-dimensional data layouts and transformations. By the end of this series, you will be able to not only understand existing GPU kernels but also design and optimize your own.
GROMACS on AMD Instinct GPUs: A Complete Build Guide
- 24 March 2026
Molecular dynamics simulations power breakthroughs in drug discovery, materials science, and computational biology. GROMACS stands as one of the most widely used molecular dynamics engines, and pairing it with AMD’s latest GPU accelerators unlocks exceptional simulation throughput. This guide walks you through installing a complete GROMACS stack with OpenMPI support on AMD MI300X and MI355X systems — whether you’re deploying on bare metal or in containers.
Engineering Qwen-VL for Production: Vision Module Architecture and Optimization Practices
- 24 March 2026
Vision–language models (VLMs) have rapidly evolved from research prototypes into foundational components of modern AI systems, enabling unified reasoning over images, videos, and text. As model scale and application complexity increase, the focus of VLM development has shifted from isolated benchmark performance toward architectural efficiency, multimodal alignment, and production readiness. Within this landscape, Qwen-VL stands out as a practical and extensible vision–language model that emphasizes modular visual encoding, flexible multimodal integration, and scalability in real-world deployments. Rather than treating vision as a peripheral add-on, Qwen-VL adopts a tightly integrated design that allows visual representations to participate deeply in language reasoning, making it particularly well suited for both large-scale inference and domain-specific customization.
Accelerating Kimi-K2.5 on AMD Instinct™ MI300X: Optimizing Fused MoE with FlyDSL
- 24 March 2026
With the recent surge in popularity of OpenClaw [1], its officially recommended model, Kimi-K2.5 [2], has taken the AI community by storm. As developers and researchers flock to this powerful Mixture-of-Experts (MoE) LLM, the need for high-performance inference on cutting-edge hardware has never been more critical.
Edge-to-Cloud Robotics with AMD ROCm: From Data Collection to Real-Time Inference
- 23 March 2026
This blog walks through a full edge-to-cloud robotics AI solution, built entirely on the AMD ecosystem and the Hugging Face LeRobot framework. In case you are not familiar, LeRobot is an open source platform from Hugging Face that provides pre-trained models, datasets, and tools for real-world robotics using PyTorch.
AMD Device Metrics Exporter v1.4.2: Enhanced Observability, Deeper RAS Insights, and Smarter GPU Telemetry for Modern HPC & AI Clusters
- 23 March 2026
Modern GPU‑accelerated systems—whether powering massive AI training workloads or tightly scheduled HPC environments—depend heavily on high‑quality telemetry. Understanding how each GPU behaves under load, how often it hits power or thermal boundaries, and how reliably the hardware performs is central to maintaining performance, diagnosing failures, and tuning systems at scale.
hipBLASLt Online GEMM Tuning
- 19 March 2026
This blog post introduces the integration of hipBLASLt Online GEMM Tuning into LLM frameworks, illustrated through an example implementation of RTP-LLM. Developed by the AMD Quark Team, hipBLASLt Online Tuning provides a user-friendly approach to improving GEMM performance by enabling runtime tuning without requiring additional offline tuning steps.
Utilizing AMD Instinct GPU Accelerators for Weather and Precipitation Forecasting with NeuralGCM
- 19 March 2026
In recent years, the landscape of weather forecasting has evolved tremendously, employing cutting-edge AI technologies to enhance prediction accuracy and speed. In previous blogs, we have demonstrated how to run several state-of-the-art AI weather forecasting models, such as Pangu-Weather, GenCast, and Aurora. Following that, this blog focuses on emerging trends in weather forecasting models, particularly the innovative NeuralGCM, which melds the strengths of General Circulation Models (GCMs) and Machine Learning (ML). We will briefly outline the design of NeuralGCM and its hybrid approach for weather and precipitation forecasting. We will then go through the required environments, installation steps, and the inference process for generating forecasts and creating plots to compare the outputs to the ground truth provided by ERA5 data.
Multi-Node Distributed Inference for Diffusion Models with xDiT
- 18 March 2026
The first two authors (Lehtiranta, Kemppi) contributed equally to this work.
GROMACS Performance on AMD Instinct MI355X
- 13 March 2026
Are you planning a hardware upgrade for your molecular dynamics workflows? In this blog, we benchmark GROMACS on AMD’s latest Instinct MI355X GPU and compare it head-to-head with the MI300X, demonstrating significant throughput improvements that accelerate time-to-results for life-science research. You will see exactly how much faster MI355X runs the standard ADH dodec benchmark across 1 to 8 GPUs. Use these results to make informed decisions about your next HPC deployment.
FP8 GEMM Optimization on AMD CDNA™4 Architecture
- 10 March 2026
This blog post continues our previous blog Matrix Core Programming on AMD CDNA™3 and CDNA™4 Architecture, which introduced Matrix Cores and demonstrated how to use them in HIP kernels.
Getting Started with ComfyUI on AMD Radeon™ RX 9000 Series GPUs
- 09 March 2026
ComfyUI has become a widely adopted and versatile node-based interface for Stable Diffusion and other generative AI models, gaining significant traction within the AI content creation community. Unlike traditional web-based interfaces, ComfyUI provides a node-based workflow system that gives users complete control over their image and video generation pipelines. Its modular architecture allows for complex workflows involving multiple models, LoRAs, ControlNets, and custom processing steps.
Agentic Diagnosis for LLM Training at Scale
- 09 March 2026
In MaxText-Slurm: Production-Grade LLM Training with Built-In Observability, we introduced MaxText-Slurm — an open-source launch system and observability stack for running MaxText LLM training on AMD Instinct GPU clusters. We showed how a unified Prometheus time-series database (TSDB) collects GPU, host, network, and training metrics into a single queryable store, persisted to disk so that no data is lost even if the job crashes.
HPC Coding Agent - Part 3: MCP Tool for Profiling
- 06 March 2026
In this blog, we build an AI agent specialized in profiling and optimizing GPU-accelerated applications within High-Performance Computing (HPC) environments. Using open-source tools, we create a state-of-the-art agent and enhance its profiling capabilities through a custom Model Context Protocol (MCP) server. This server provides the agent with tools to leverage AMD’s profiling utilities for analyzing application performance on AMD GPUs.
Fine-Tuning AI Surrogate Models for Physics Simulations with Walrus on AMD Instinct GPU Accelerators
- 06 March 2026
Physics simulations are used for studying complex systems and are essential where experiments are difficult, expensive, or impossible. In our context, a simulation means numerically solving mathematical equations that are believed to describe a physical system and evolving them forward in time on a computer. They enable controlled exploration of physical behavior for science and engineering, but at a high computational cost, which in most cases increases rapidly with scale. Our focus is on continuum dynamics, where the system is represented by fields such as density, velocity, or temperature, defined on a grid and evolving over time. High-resolution physics simulations are slow to run, sensitive to numerical error and impractical for large parameter spaces. Surrogate models address these limitations by learning to approximate simulation dynamics directly from data. Once trained, they can produce fast predictions at a fraction of the cost, giving researchers the ability to rapidly explore parameter space and generate long rollouts.
Ensemble High-Resolution Weather Forecasting on AMD Instinct GPU Accelerators
- 06 March 2026
Weather prediction is fraught with uncertainty, as is the inference of any real-world phenomena dependent on physical observations. The consequence is that any estimated current state of the atmosphere as well as any forecast both carry a level of uncertainty. As such, any weather forecasting model, whether AI or traditional, needs to produce reasonable outputs despite the inherent uncertainty of inputs, and, if possible, quantify the uncertainty of the outputs for the user in some practical fashion.
HPC Coding Agent - Part 2: An MCP Tool for Code Optimization with OpenEvolve
- 04 March 2026
Large language models (LLMs) and LLM-driven agents (AI agents) are already trained on a massive amount of data where a considerable portion consists of code, and both models and agentic coding services are developed specifically for the purpose of coding. For users who want to optimize their code for certain purposes, for example runtime or memory efficiency, LLMs may produce plausible solutions, but these are often not optimal.
Streamlining Recommendation Model Training on AMD Instinct™ GPUs
- 02 March 2026
Recommendation model training and inference workloads represent a significant portion of computational requirements across industries including e-commerce, social media and content streaming platforms. Unlike LLMs, recommendation models result in to complex and often imbalanced communication across GPUs, along with a higher load on the CPU-GPU interconnect. The ROCm training docker [1] now includes essential libraries for recommendation model training. This blog demonstrates the functionality and ease of training recommendation models using ROCm, along with suggestions for improved configuration of these workloads. We also highlight the inherent benefits of the large HBM size on AMD Instinct™ GPUs for recommendation workloads.
MaxText-Slurm: Production-Grade LLM Training with Built-In Observability
- 02 March 2026
Training large language models (LLMs) at scale on GPU clusters is not just a compute problem — it is an operations problem. Launching multi-node distributed training, keeping it running reliably, and diagnosing failures when they happen all require tooling that most training frameworks do not provide. MaxText-Slurm is an open-source launch system and observability stack that bridges this gap for MaxText on AMD Instinct GPU clusters managed by Slurm.
Exploring Use Cases for Scalable AI: Implementing Ray with ROCm 7 Support for Efficient ML Workflows
- 27 February 2026
This blog builds on insights from our previous blog post, which introduced Ray 2.48.0.post0 running on ROCm 6.2 and demonstrated Reinforcement Learning from Human Feedback (RLHF) with verl 0.3.0.post0 and vLLM 0.6.4 on AMD GPUs. In this follow‑up, we introduce Ray 2.51.1 with ROCm 7.0.0, verl 0.6.0, and vLLM 0.11.0.dev, highlighting the new performance benefits and capabilities for large‑scale RLHF workloads.
PyTorch Offline Tuning with TunableOp
- 24 February 2026
In an earlier blog post, we explored how PyTorch TunableOp can potentially accelerate models through online tuning - where during model execution, PyTorch benchmarks and selects optimal BLAS kernels. While online tuning is effective, it introduces overhead due to the time needed to execute the ML model from end-to-end. If this is done once, the overhead may be acceptable, but for repeated tuning it may be cost-prohibitive to keep re-running the model.
JAX-AITER: Bringing AMD’s Optimized AI Kernels to JAX on ROCm™
- 24 February 2026
If you’re building large models in JAX on AMD GPUs, you want fast, reliable kernels without spending weeks tuning them yourself. That’s exactly the need that led us to create JAX-AITER.
Getting Started with AMD Resource Manager: Efficient Sharing of AMD Instinct™ GPUs for R&D Teams and AI Practitioners
- 24 February 2026
In this blog, you will learn how to use AMD Resource Manager and its components for centralized AI infrastructure governance. It’s part of the AMD Enterprise AI Suite, a full-stack solution for developing, deploying and running AI workloads on a Kubernetes platform designed to support AMD compute. The AMD Resource Manager provides a user-friendly graphical user interface (GUI) and Command Line Interface (CLI) with a unified control plane that simplifies tasks such as managing compute clusters, user access, monitoring resource utilization, and allocating the right compute quotas to the right projects.
Primus-Pipeline: A More Flexible and Scalable Pipeline Parallelism Implementation
- 23 February 2026
Error parsing meta tag attribute “keywords”: No content.
FlyDSL: Expert GPU Kernel Development with the Ease of MLIR Python Native DSL on AMD GPUs
- 20 February 2026
The AMD ROCm™ software ecosystem continues to grow rapidly as developers build new kernels, compilers, and AI frameworks optimized for AMD GPUs. As workloads become more complex and the demand for both performance and agility increases, a clear need has emerged for a modern, flexible, and open GPU kernel authoring framework.
Introducing hipThreads: A C++ - Style Concurrency Library for AMD GPUs
- 19 February 2026
In this blog, you will learn how to accelerate C++ code developed for the CPU to run on AMD GPUs using hipThreads by incrementally porting familiar
std::thread patterns to GPU-resident hip::thread code.
We walk through a step-by-step SAXPY example, explain key concepts like persistent threads and fibers, and share real performance
results to help you evaluate when this model fits your workload.
Unlocking Sparse Acceleration on AMD GPUs with hipSPARSELt
- 17 February 2026
Sparse computation is a cornerstone of modern AI acceleration. As models like LLaMA and DINOv2 ViT-L scale in size and complexity, the demand for efficient matrix operations becomes increasingly critical. To address this, semi-structured sparsity, also known as the 2:4 structured sparsity pattern, has emerged as a powerful optimization technique.
Advanced MXFP4 Quantization: Combining Fine-Tuned Rotations with SmoothQuant for Near-Lossless Compression
- 17 February 2026
As language models continue to grow in popularity, reducing the cost of inference and accelerating model serving have become key challenges. Quantization offers a powerful solution by reducing the model size and leveraging inexpensive math operations, for example, using low-bitwidth formats like OCP MXFP4 (4.25 bits) available in AMD Instinct MI350X and MI355X accelerators.
Adaptive Top-K Selection: Eliminating Performance Cliffs Across All K Values on AMD GPUs
- 17 February 2026
Top-K selection is critical for LLMs and RAG workloads, yet standard Radix Sort implementations often suffer from performance cliffs at small K values due to fixed initialization overheads. In our AITER library (introduced in our previous blog [1]), we originally utilized an 11-bit radix sort for Top-K selection. While this approach excels at scale, we identified a critical efficiency gap for the lightweight filtering often required during modern inference.
Elevate Your LLM Inference: Autoscaling with Ray, ROCm 7.0.0, and SkyPilot
- 13 February 2026
This blog explores autoscaling of inference workloads in Ray Serve with a vLLM backend on AMD Instinct™ GPUs for large language models (LLMs). Furthermore, you will learn how to scale beyond a single cluster using SkyPilot, which enables multicloud scaling for Ray Serve. Combined with the AMD ROCm™ software platform, this creates a unified, cloud-agnostic platform that scales distributed LLM inference from single-GPU to multi-cluster deployments.
Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm 7.0.0
- 12 February 2026
In our previous blog post, we introduced Volcano Engine Reinforcement Learning for LLMs (verl) 0.3.0.post0 with ROCm 6.2 and vLLM 0.6.4. In this blog post, we will provide you with an overview of verl 0.6.0 with ROCm 7.0.0 and vLLM 0.11.0.dev and its benefits for large-scale reinforcement learning from human feedback (RLHF). You will also learn about the modifications made to optimize verl performance on AMD Instinct™ MI300X GPUs. Next, you will walk through building the Docker image on your system, along with training scripts for single-node and multi-node setups. Lastly, we provide you with verl performance results, focusing on throughput and convergence accuracy achieved on AMD Instinct MI300X GPUs. Follow this guide to get started with verl on AMD Instinct GPUs and accelerate your RLHF training with ROCm-optimized performance.
Solution Blueprints: Accelerating AI Deployment with AMD Enterprise AI
- 11 February 2026
AMD Enterprise AI Suite standardizes the inference layer with AMD Inference Microservices (AIMs), a set of containers for optimized model serving on AMD Instinct™ GPUs with validated profiles and OpenAI-compatible APIs. However, production grade agentic and generative AI applications need more than inference endpoints. You need document loaders, embedding pipelines, vector databases, RAG logic, agent orchestration, and user interfaces. These components need to be wired together with proper Kubernetes resource definitions, GPU allocation, service discovery, and configuration management. This blog walks through the technical implementation of Solution Blueprints: how they’re structured, how they use Helm application charts for code reuse, and the patterns they demonstrate for multi-container orchestration. While the Enterprise AI Suite Overview covers the platform and the AIMs blog covers inference, this post focuses on application architecture and deployment patterns.
Digital Twins on AMD: Building Robotic Simulations Using Edge AI PCs
- 09 February 2026
Digital twins are becoming a core tool in robotics, automation, and intelligent systems. They provide a virtual representation of a physical system, allowing developers to validate robot behaviors, test motion strategies, and generate datasets before deploying anything in the real world.
Building Robotics Applications with Ryzen AI and ROS 2
- 09 February 2026
This blog showcases how to deploy power-efficient Ryzen AI perception models with ROS 2 - the Robot Operating System. We utilize the Ryzen AI Max+ 395 (Strix-Halo) platform, which is equipped with an efficient Ryzen AI NPU and iGPU. The Ryzen AI CVML Library is used to deploy supported models efficiently on the Ryzen AI platform. All of the code is available on GitHub in the AMD Ryzers repository and was originally presented at ROSCon’25.
Resilient Large-Scale Training: Integrating TorchFT with TorchTitan on AMD GPUs
- 08 February 2026
Training large AI models on AMD GPUs demands unwavering stability and robust fault-tolerance capabilities at cluster scale. Yet today’s ROCm-based multi-node GPU deployments often rely on brittle checkpoint-and-restart mechanisms to recover from failures. This approach wastes precious compute cycles and slows down training as model sizes and cluster scales grow. To address these challenges, we integrated PyTorch’s native fault-tolerance framework—TorchFT—with the TorchTitan training framework on AMD’s Primus-SaFE Kubernetes platform, achieving resilient, checkpoint-less training at hundred-GPU scale. This blog builds upon our previous work on the Primus ecosystem—for background on the platform architecture, see our earlier posts on Primus-SaFE, the Primus training framework, and training large models with Primus.
Accelerating Graph Layout with AI and ROCm on AMD GPUs
- 06 February 2026
Learn how easy it is to implement established graph algorithms, and deploy them on AMD GPUs with immediate performance improvements, using AI as a coding partner!
Micro-World: First AMD Open-Source World Models for Interactive Video Generation
- 05 February 2026
World models aim to simulate aspects of the real world, enabling more effective training and exploration of AI agents and ultimately paving the way toward richer forms of digital life. Games can be viewed as another form of world simulation, and their data is relatively easy to collect and annotate, making them a natural playground for building and studying world models. GameNGen [1] has demonstrated the potential of this direction, while works such as GameFactory [2], Matrix-Game [3], and Hunyuan-GameCraft [4] further showcase strong performance in game-oriented world modeling. However, these projects are either fully closed-sourced or release only partial components (typically inference-only), which limits reproducibility and community-driven progress.
Foundations of Molecular Generation with GP-MoLFormer on AMD Instinct MI300X Accelerators
- 03 February 2026
Nearly every technological breakthrough we celebrate begins with a material that did not exist before someone imagined it. Modern computing rests on engineered semiconductors, energy storage depends on carefully designed electrolytes, and sustainable technologies increasingly rely on alternatives to scarce or environmentally costly rare earth elements. Designing such materials with specific properties at scale is one of the most challenging and consequential problems in science.
Debugging NaN Results in CK Tile GEMM: A rocgdb Detective Story
- 30 January 2026
When developing high-performance GPU kernels, subtle bugs can lead to catastrophic failures like NaN (Not-a-Number) outputs. This post chronicles our journey of debugging a tricky NaN issue in AMD’s Composable Kernel (CK) Tile GEMM implementation using rocgdb. What started as mysterious NaN outputs ended with discovering a single-character typo that corrupted the data distribution.
ROCm 7.2: Smarter, Faster, and More Scalable for Modern AI Workloads
- 22 January 2026
Modern AI workloads demand more than raw compute—they require a tightly integrated software stack that can extract maximum performance, scale efficiently across systems, and operate reliably in production environments. With the latest ROCm 7.2 release, we’re delivering a broad set of optimizations and software enhancements designed to improve developer productivity, runtime performance, and enterprise readiness.
Nitro-AR: A Compact AR Transformer for High-Quality Image Generation
- 22 January 2026
Recent years have witnessed remarkable progress in image generation, driven by two major modeling paradigms: diffusion-based models and autoregressive (AR) models. Building upon our previously released Nitro-E, a light-weight diffusion model for fast image synthesis, this blog explores a complementary direction, applying the architecture in an AR framework.
LLM Inference Optimization Using AMD GPU Partitioning
- 22 January 2026
As AI and HPC workloads grow in complexity and scale, there’s a rising need for precise GPU resource management, robust memory isolation, and efficient multi-tenant scheduling. AMD’s Instinct™ MI300 series addresses this by offering dynamic partitioning capabilities. These allow a single physical device to be segmented into multiple isolated partitions, each tailored to the needs of specific workloads. This flexibility is particularly beneficial for AI inference tasks, where different models or instances may require distinct resource allocations. Maximizing the utilization of GPU resources while ensuring that each workload operates within its own isolated environment is crucial for performance and reliability.
ROCm Becomes a First-Class Platform in the vLLM Ecosystem
- 21 January 2026
As the generative AI ecosystem matures, vLLM embraces a multivendor ecosystem. The quality of support across hardware platforms becomes a defining priority: developers expect consistent, high-performance behavior no matter which GPU they choose. Today, we are proud to announce a major realization of that vision: AMD ROCm™ is now a first-class platform in the vLLM ecosystem.
Quickly Developing Powerful Flash Attention Using TileLang on AMD Instinct MI300X GPU
- 20 January 2026
Against the backdrop of the rapid development of the AMD ROCm™ software ecosystem, the high barrier to operator development has long been a bottleneck. The emergence of TileLang provides developers with an efficient solution. As an emerging AI operator development framework, tilelang encapsulates low-level GPU details with concise syntax, enabling developers to fully tap into the computing potential of AMD GPUs without requiring in-depth knowledge of low-level languages such as HIP. The AMD Instinct™ MI300X GPU, as a flagship GPU for AI workloads, boasts ultra-high bandwidth memory and powerful compute units, but it requires adaptive high-performance operators to unleash its capabilities. In this blog, we will take Flash Attention, a key kernel in both LLM training and inference, as an example to fully demonstrate the development process based on TileLang on the MI300X, highlighting the dual benefits of efficiency and performance that TileLang brings to AMD operator development.
Deep Dive into Primus: High-Performance Training for Large Language Models
- 15 January 2026
Primus is the AMD unified training framework designed to deliver high-performance, scalable large language models (LLMs) training across multiple backends – including TorchTitan and Megatron-LM. It provides a consistent CLI interface, while each backend ships with carefully optimized configurations for popular open-source models. These backend-specific presets ensure the best out-of-the-box performance on AMD Instinct™ GPUs. In this deep dive, we walk through the best practices for achieving peak performance when training dense LLMs on Primus.
Applying Compute Partitioning for Workloads on MI300X GPUs
- 14 January 2026
This blog explains how to use AMD GPU compute partitioning to increase throughput, utilization and reduce time-to-results for two different types of workloads:
Reimagining GPU Allocation in Kubernetes: Introducing the AMD GPU DRA Driver
- 13 January 2026
In this blog, you’ll learn how Kubernetes’ new Dynamic Resource Allocation (DRA) framework and the AMD GPU DRA Driver turn GPUs into first-class, attribute-aware resources. We’ll walk through how to publish AMD Instinct GPUs via ResourceSlices, request specific models and partition profiles with declarative ResourceClaims, and observe allocations through Kubernetes-native lifecycle objects, so you can simplify cluster operations compared to traditional Device Plugin–based setups.
Installing AMD HIP-Enabled GROMACS on HPC Systems: A LUMI Supercomputer Case Study
- 12 January 2026
Running molecular dynamics (MD) simulations efficiently is critical for accelerating scientific discovery in many life science use cases, e.g., drug discovery. GROMACS is a widely used, GPU-accelerated molecular dynamics engine powering many life science workflows, but its performance can vary significantly depending on the installation method and hardware configuration. For broader context on GROMACS applications in drug design, see recent research on GROMACS in cloud environments for alchemical drug design.
Athena-PRM: Enhancing Multimodal Reasoning with Data-Efficient Process Reward Models
- 12 January 2026
This blog introduces Athena-PRM, a multimodal Process Reward Model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. To efficiently generate high-quality process-labeled data, we leverage prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. We also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data.
Using Gradient Boosting Libraries on MI300X for Financial Risk Prediction
- 08 January 2026
In the world of machine learning, the choice of hardware can significantly impact the performance and efficiency of model training and prediction. Gradient Boosting Machines (GBMs) benefit greatly from GPU parallelization in several key algorithmic steps involving independent, repetitive computations. The most substantial speedup comes from histogram construction and best split searching, as these can be executed in parallel across features and candidate splits using thousands of GPU cores, vastly accelerating tree building. Additionally, the calculation of gradients and Hessians for each data point is naturally parallelizable and well suited to GPU architectures. Other operations—such as leaf value updates, data preprocessing (like quantization and normalization), and batch predictions—can also be distributed efficiently across GPU threads. By exploiting parallelism in these stages, GPUs dramatically reduce training and prediction time for GBMs, making them ideal for large datasets or scenarios where quick model iteration is crucial.
Introducing the AMD Network Operator v1.0.0: Simplifying High-Performance Networking for AMD Platforms
- 08 January 2026
In this blog, you will learn how the AMD Network Operator simplifies high-performance networking for AMD GPU clusters, automates NIC discovery and configuration, supports RDMA/RoCE workloads, and provides real-time monitoring to keep your AI/ML and HPC jobs running efficiently.
Bridging the Last Mile: Deploying Hummingbird-XT for Efficient Video Generation on AMD Consumer-Grade Platforms
- 08 January 2026
*The first three authors (Isobe, Cui, and Ge) contributed equally to this work.
High-Resolution Weather Forecasting with StormCast on AMD Instinct GPU Accelerators
- 07 January 2026
The traditional approach to numerical weather prediction is based on propagating a known atmospheric state forward in time in short steps using systems of partial differential equations directly obtained from physical considerations. A new approach is to use machine learning methods to directly proceed to a later state in one large step, typically moving forward multiple hours in a single jump.
Breaking the Accuracy-Speed Barrier: How MXFP4/6 Quantization Revolutionizes Image and Video Generation
- 07 January 2026
This blog introduces MXFP4 and MXFP6, the newly supported data types on AMD Instinct™ MI350 Series GPUs, and demonstrates their remarkable quality in image and video generation tasks. By reading this blog, you will discover how these low-bit formats can break the accuracy-speed tradeoff, boosting both efficiency and performance in generative AI workflows.
ROCm MaxText Testing — Decoupled (Offline) and Cloud-Integrated Modes
- 06 January 2026
In this blog, you will learn how to run MaxText unit tests on AMD ROCm GPUs in two complementary modes: offline (decoupled) and fully cloud-integrated. By the end, you will know when to use each mode, how to interpret the results, and how to fold them into your CI and debugging workflows.
ROCm Fork of MaxText: Structure and Strategy
- 06 January 2026
In this blog you will explore how the ROCm fork of MaxText is structured and how that structure supports ROCm and fully offline, decoupled workflows across platforms.
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
- 02 January 2026
In this blog we will discuss SparK, a training-free, plug-and-play method for KV cache compression in large language models (LLMs). By addressing the overlooked redundancy in feature channels and employing a “prune-and-recover” strategy, SparK reduces KV cache storage by over 30% compared to traditional methods while maintaining model accuracy. It offers a robust solution for long-context inference, establishing a new perspective on unstructured sparsity.
Accelerating Multimodal Inference in vLLM: The One-Line Optimization for Large Multimodal Models
- 02 January 2026
Deploying multimodal models like Qwen3-VL or InternVL at scale reveals a hidden bottleneck. While Tensor Parallelism (TP) is essential for massive language decoders, it is often overkill for vision encoders. These encoders are typically small, often just 1-5% of total model size, so there is limited compute benefit from sharding them. However, they still incur expensive all-reduce communication costs after every single layer.