All Posts

Running ComfyUI in Windows with ROCm on WSL

07 August 2025

If you have an AMD Radeon™ graphics card supported by AMD ROCm™ software, you can unlock the full potential of your Windows PC with ROCm using the Windows Subsystem for Linux (WSL). Whether you’re loading large models like Stable Diffusion for local use or exploring creative AI applications, this setup offers unprecedented accessibility and power right at your fingertips. In this blog, we provide a step-by-step guide for configuring a WSL-based ROCm environment to run ComfyUI, including driver installation, dependency management, and PyTorch integration optimized for AMD GPUs (see figure 1).

Read more ...

Day 0 Developer Guide: Running the Latest Open Models from OpenAI on AMD AI Hardware

05 August 2025

OpenAI has officially released its open models: gpt-oss-120b and gpt-oss-20b. AMD now provides out-of-the-box, day 0 support for the latest open models from OpenAI, enabling developers to easily fine-tune and deploy across cloud to client environments using AMD hardware, the AMD ROCm™ and AMD Ryzen™ AI software stack, and seamless open source integrations. At AMD, we’re excited to announce day 0 support across our AI hardware, including our flagship AMD Instinct™ MI355X and MI300X GPUs, AMD Radeon™ AI PRO R9700 GPUs, and AMD Ryzen™ AI processors.

Read more ...

AMD Hummingbird Image to Video: A Lightweight Feedback-Driven Model for Efficient Image-to-Video Generation

03 August 2025

In this blog, we present AMD Hummingbird-I2V, a lightweight and feedback-driven image-to-video generation model designed to deliver high-quality results efficiently on resource-constrained hardware. Image-to-video (I2V) generation has become a significant challenge in computer vision, driven by the increasing demand for automated content creation in areas such as digital media production, animation, and advertising. While recent advancements have improved video quality, deploying I2V models in practical scenarios remains challenging due to their large model sizes and high inference costs. For example, DynamiCrafter [1] employs a 1.4B-parameter U-Net and typically requires 50 denoising steps to synthesize a single video. Step-Video [2], a DiT-based model with 30B parameters, takes approximately 30 minutes to generate one video on an AMD Instinct ™ MI250 GPU, making it impractical for latency-sensitive or resource-constrained environments, such as gaming-oriented desktop GPUs. In this work, we present AMD Hummingbird-I2V, a compact and efficient diffusion-based I2V model designed for high-quality video synthesis under limited computational budgets. Hummingbird-I2V adopts a lightweight U-Net architecture with 0.9B parameters and a novel two-stage training strategy guided by reward-based feedback, resulting in substantial improvements in inference speed, model efficiency, and visual quality. To further improve output resolution with minimal overhead, we introduce a super-resolution module at the end of the pipeline. Additionally, we leverage ReNeg [3], an AMD proposed reward-guided framework for learning negative embeddings via gradient descent, to further boost visual quality. As a result, Hummingbird-I2V can generate high-quality 4K video in just 11 seconds with 16 inference steps on an AMD Radeon™ RX 7900 XTX GPU. Quantitative results on the VBench-I2V [4] benchmark show that Hummingbird-I2V achieves state-of-the-art performance among U-Net-based diffusion models and competitive results compared to significantly larger DiT-based models. We provide a detailed analysis of the model architecture, training methodology, and benchmark performance.

Read more ...

GEAK: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

01 August 2025

At AMD, we are pioneering ways to accelerate AI development using AI itself, by generating accurate and efficient GPU kernels. Specifically, we are starting with the automatic generation of kernels in Triton, an open-source Python-like language for writing parallel programming code for GPUs. Today, AMD is excited to announce (a) Generating Efficient AI-centric Kernels (GEAK) for AMD GPUs, and results on (b) two Triton kernel evaluation benchmarks, where we show how AI agents can perform inference-time scaling with frontier LLMs to generate accurate and efficient kernels for AMD Instinct™ GPUs like MI250X and MI300X.

Read more ...

Graph Neural Networks at Scale: DGL with ROCm on AMD Hardware

31 July 2025

This blog introduces the Deep Graph Library (DGL) and explores its significance on AMD hardware for enabling scalable, performant graph neural networks.

Read more ...

Accelerating Parallel Programming in Python with Taichi Lang on AMD GPUs

31 July 2025

Taichi Lang is an open-source, imperative, parallel programming language for high-performance numerical computation. It is embedded in Python and uses just-in-time (JIT) compiler frameworks (e.g. LLVM) to offload the compute-intensive Python code to the native GPU or CPU instructions. The language has broad applications spanning real-time physical simulation, numerical computation, augmented reality, artificial intelligence, vision and robotics, visual effects in films and games, general-purpose computing, and much more [1].

Read more ...

Avoiding LDS Bank Conflicts on AMD GPUs Using CK-Tile Framework

25 July 2025

LDS bank conflict is a common performance bottleneck in GPU kernel development. Composable Kernel (CK-Tile), a kernel development framework for AMD GPUs, provides a framework-level solution for LDS bank conflicts. Composable Kernel for ROCm is used to build portable high-performance kernels for accelerating computing, e.g. HPC, DL and LLMs for training and inference workloads. In this blog, we show you how to analyze, detect, and eliminate LDS bank conflicts using CK-Tile, AMD’s composable GPU kernel framework. A GEMM kernel serves as a classic example for analyzing how threads interact with LDS during both reads and writes. Starting with a naïve memory layout, we evaluate bank conflict behavior, explore mitigation techniques such as padding, and ultimately demonstrate how an XOR-based swizzle transformation achieves a bank conflict-free design.

Read more ...

Benchmarking Reasoning Models: From Tokens to Answers

24 July 2025

This blog shows you how to benchmark large language models’ reasoning tasks by distinguishing between mere token generation and genuine problem-solving. You will learn the importance of configuring models like Qwen3 with “thinking mode” enabled, how standard benchmarks can produce misleading results, why reasoning requires more than just generating tokens quickly, and how to build evaluations that reflect the model’s true problem-solving capabilities. Sounds interesting? Let’s dive right in!

Read more ...

Chain-of-Thought Guided Visual Reasoning Using Llama 3.2 on a Single AMD Instinct MI300X GPU

21 July 2025

In this post, we will show you how to fine-tune the Llama 3.2 Vision Instruct models, specifically the 11B and 90B parameter variants, on a synthetic multi-modal dataset using torchtune. This blog focuses on chain-of-thought (CoT) guided visual reasoning, a technique where the model is encouraged to articulate intermediate reasoning steps before arriving at a final answer. By incorporating the CoT approach, we aim to improve the model’s interpretability and accuracy in tasks that require multi-step understanding of visual inputs. By utilizing the high-bandwidth memory (HBM) of the AMD Instinct™ MI300X GPU, we aim to enhance the model’s vision understanding, particularly for interpreting charts, all on a single GPU provided by TensorWave. Our evaluation shows that we can train an 11B parameter model to perform with 2.3x better accuracy than a 90B parameter model. The blog will walk you through our dataset preparation, model configuration, training recipes, and evaluation—all optimized to run on a single GPU.

Read more ...

Introducing ROCm-LS: Accelerating Life Science Workloads with AMD Instinct™ GPUs

18 July 2025

AMD is thrilled to announce the early access release of ROCm-LS (ROCm Life Science), a new cutting-edge software toolkit designed to accelerate life science computational workloads on AMD Instinct™ GPUs. ROCm-LS joins ROCm-DS as a part of AMD’s new family of toolkits aimed at providing powerful solutions to real world problems. Similar to ROCm-DS, ROCm-LS is built upon the established ROCm software ecosystem, offering a collection of components and libraries that address the pressing needs of the life science community. The early access release of ROCm-LS enables you to experiment with accelerating your life science workloads, such as digital pathology, automated medical image analysis, and feature extraction and enhancement in large TIFF files on AMD Instinct GPUs. Join us in exploring this tantalizing glimpse into the future capabilities of ROCm-LS, setting the stage for the next evolution in life science computing.

Read more ...

Announcing hipCIM: A Cutting-Edge Solution for Accelerated Multidimensional Image Processing

18 July 2025

In the rapidly evolving landscape of data science and computational imaging, hipCIM 1.0.0 introduces a powerful, GPU-accelerated open-source library that redefines multidimensional image processing for life sciences, biomedical research, and computational imaging. This open-source, accelerated software library redefines how multidimensional datasets are processed, offering unparalleled capabilities across scientific fields such as biomedical imaging, geospatial analytics, material sciences, life sciences, and remote sensing to name a few. With the initial release of hipCIM 1.0.0, AMD enters the arena, ready to push the boundaries of life science research and stand at the forefront of a new era in multidimensional image processing.

Read more ...

Vibe Coding Pac-Man Inspired Game with DeepSeek-R1 and AMD Instinct MI300X

17 July 2025

AI systems have been constrained by their narrow capabilities and limited contextual understanding. Modern large language models (LLMs), such as GPT-4, Claude, DeepSeek, and CodeLlama, are different from previous approaches to AI. LLMs leverage vast datasets and incorporate natural language and code repositories. This enables them to understand natural language syntax, semantics, and programming logic in multiple programming languages (Python, JavaScript, C++, etc.)

Read more ...

Instella-T2I: Open-Source Text-to-Image with 1D Tokenizer and 32× Token Reduction on AMD GPUs

15 July 2025

In this blog, we introduce Instella T2I, text-to-image models in the AMD open-source Instella model family built from scratch on AMD Instinct™ MI300X GPUs. We’ll walk through the model architecture, training pipeline, tokenizer innovations, and how the system scales efficiently across MI300X GPUs. Instella-T2I v0.1 sets a new baseline for scalable, high-resolution open-source text-to-image generation. You will also explore how AMD is helping advance this space—and how you can get started with the model today. In Instella-T2I, we build upon the rapid advancements in large language models (LLMs) and investigate the use of decoder-only models as text encoders in T2I models as shown in Figure 1.

Read more ...

Fine-tuning Robotics Vision Language Action Models with AMD ROCm and LeRobot

14 July 2025

This blog showcases training and deploying robotics policy models on AMD Instinct™ GPUs using ROCm with Hugging Face’s LeRobot framework. Recent advancements in Vision Language Action Models (VLAs) represent a breakthrough in robotics AI, combining computer vision, language understanding, and robotic control into unified architectures that can process visual observations, understand task descriptions, and generate precise motor commands.

Read more ...

Accelerating Video Generation on ROCm with Unified Sequence Parallelism: A Practical Guide

11 July 2025

Video generation models like HunyuanVideo and Wan 2.1 are rapidly improving, producing high-fidelity text-to-video and image-to-video outputs. These models generate content with such realism that distinguishing synthetic videos from real ones is increasingly difficult. At the core of this progress lies diffusion-based generative modeling, which has evolved from traditional U-Net–style convolutional encoder-decoders to more powerful Diffusion Transformers (DiTs). This architectural shift enables better modeling of complex spatial-temporal dependencies across frames, addressing key limitations in earlier designs.

Read more ...

Nitro-T: Training a Text-to-Image Diffusion Model from Scratch in 1 Day

09 July 2025

AMD is excited to release Nitro-T, a family of text-to-image diffusion models focused on highly efficient training. Our models achieve competitive scores on image generation benchmarks compared to previous models focused on efficient training while requiring less than 1 day of training from scratch on 32 AMD Instinct MI300X GPUs.

Read more ...

vLLM V1 Meets AMD Instinct GPUs: A New Era for LLM Inference Performance

07 July 2025

vLLM has been a successful LLM inference and serving engine that excels at providing innovative features to users and developers. Earlier this year, the vLLM community introduced a major upgrade of its core engine and architecture to vLLM V1 (V1), which enhances the flexibility and scalability of the engine while retaining its core features. For simplicity, we’ll refer to vLLM V0 as “v0” and vLLM V1 as “V1” throughout this post. To align with the vLLM community’s continuous innovation, the AMD ROCm™ software team and open-source ROCm developers have enabled the fully optimized vLLM V1 engine on AMD GPUs.

Read more ...

Unlocking GPU-Accelerated Containers with the AMD Container Toolkit

03 July 2025

In the rapidly evolving fields of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML), containerization has become a cornerstone of modern application deployment. Containers provide a lightweight, portable, and scalable way to package applications and their dependencies. The integration of GPUs into these environments has become imperative. However, leveraging GPU acceleration within containers has historically been a complex and error-prone process, particularly when ensuring seamless access to GPU hardware resources.

Read more ...

Accelerated LLM Inference on AMD Instinct™ GPUs with vLLM 0.9.x and ROCm

28 June 2025

AMD is pleased to announce the release of vLLM 0.9.x, delivering significant advances in LLM inference performance through ROCm™ software and AITER integration. This release provides a variety of powerful optimizations and exciting new capabilities to the AMD ROCm software ecosystem as shown in Figure 1, below. Whether you are a developer or a researcher, this release is designed to help you unlock new levels of performance and explore wider model support on AMD Instinct™ GPUs.

Read more ...

Performance Profiling on AMD GPUs – Part 1: Foundations

26 June 2025

Error parsing meta tag attribute “keywords”: No content.

Read more ...

Enabling Real-Time Context for LLMs: Model Context Protocol (MCP) on AMD GPUs

20 June 2025

The Model Context Protocol (MCP) is an open protocol introduced by Anthropic that standardizes how applications provide context to large language models (LLMs). It enables AI models to interface with various data sources and tools. MCP enhances the integration of LLMs with data and tools by offering pre-built integrations, flexibility in switching between different LLM providers, and ensuring data security best practices.

Read more ...

Fine-Tuning LLMs with GRPO on AMD MI300X: Scalable RLHF with Hugging Face TRL and ROCm

18 June 2025

In this blog, you will learn how to implement GRPO-based RLHF on AMD MI300X using ROCm and Hugging Face TRL—streamlining alignment training while enhancing model reasoning and inference performance. Reinforcement Learning from Human Feedback (RLHF) constitutes a critical phase in the fine-tuning of large language models (LLMs) and multimodal architectures. Over time, RLHF methodologies have advanced beyond traditional techniques, progressing from Proximal Policy Optimization (PPO) to Direct Preference Optimization (DPO), and more recently, to Group Relative Policy Optimization (GRPO). RLHF aims to make LLMs’ output better aligned with human preferences. Reinforcement Learning (RL) is an important step to enhance LLM’s reasoning capabilities and for better inference/test-time scaling law. Apart from LLM, there is also DPO application in text-to-image generation.

Read more ...

Continued Pretraining: A Practical Playbook for Language-Specific LLM Adaptation

18 June 2025

What if you could make a state-of-the-art LLM fluent in a new language—without training from scratch? In this guide, we show how we did just that with Finnish.

Read more ...

Aligning Mixtral 8x7B with TRL on AMD GPUs

12 June 2025

Building a ChatGPT-like assistant is a multi-step process that starts with pre-training a large language model (LLM) on internet-scale data across clusters of thousands of GPUs, resulting in what is known as a “base model”. This base model is then refined through an instruction based supervised fine-tuning (SFT) process, which trains it to function as a useful digital assistant capable of understanding and responding accurately to a wide range of queries. Finally, human preference alignment is applied to enhance the model’s friendliness, helpfulness, and safety, ensuring that interactions are not only informative but also pleasant for users. This combination of techniques creates a sophisticated assistant that is both powerful and user-centric—exemplified by AMD’s new Instella-Long assistant.

Read more ...

Introducing Instella-Long: A Fully Open Language Model with Long-Context Capability

11 June 2025

AMD is excited to announce Instella-Long, a long-context language model continually trained from Instella-3B-Instruct on AMD Instinct™ MI300X GPUs. To our knowledge, Instella-Long makes Instella series the first fully open language model trained from scratch that supports long-context. Instella-Long can support 128K context length and achieve competitive performance outperforming open-weights models such as Phi-3.5-mini [1], Gemma-3-4B [2], and Qwen2.5-3B [3] on the long-context benchmark.

Read more ...

AMD ROCm: Powering the World’s Fastest Supercomputers

10 June 2025

From breaking the exaFLOP barrier with Frontier to setting new performance records with El Capitan, AMD is transforming what’s possible in high-performance computing (HPC). But the story goes beyond hardware. At the core of these world-class systems is ROCm, AMD’s open, high-performance software platform enabling new levels of scientific discovery and AI advancement.

Read more ...

LLM Quantization with Quark on AMD GPUs: Accuracy and Performance Evaluation

09 June 2025

As large language models (LLMs) grow in size and complexity, efficient inference becomes increasingly important. Quantization is a widely adopted technique to reduce memory usage and improve performance by representing weights and activations with lower-precision formats (e.g., FP16 to INT8 or FP8). This blog demonstrates how to use AMD’s Quark to quantize large language models (LLMs) on AMD GPUs, and evaluates the resulting accuracy and performance. Additionally, the runtime performance of the quantized model is benchmarked across two widely used inference frameworks: vLLM and SGLang.

Read more ...

The ROCm Revisited Series

06 June 2025

The ROCm Revisited series aims to revisit key concepts of the AMD ROCm software platform, tools, and optimizations, tailored for beginner and intermediate developers. This series shares our journey through the evolution of ROCm, highlighting the milestones, innovative technologies, and challenges we’ve overcome to establish leadership in the supercomputing space. Each post explores different aspects of ROCm’s development, focusing on how it has transformed industries, particularly in AI, machine learning, and high-performance computing (HPC). Through these blog posts, we’ll also discuss our commitment to open-source development and the future potential of distributed and energy-efficient computing. Below are the three blogs included in the series:

Read more ...

ROCm Revisited: Getting Started with HIP

06 June 2025

This blog is part of our ROCm Revisited series [1]. The purpose of this series is to share the story of ROCm and our journey through the changes and successes we’ve achieved over the past few years.

Read more ...

ROCm Revisited: Evolution of the High-Performance GPU Computing Ecosystem

06 June 2025

09 June 2025

This blog is part of our ROCm Revisited series [1]. The purpose of this series is to share the story of ROCm and our journey through the changes and successes we’ve achieved over the past few years. We’ll explore the key milestones in our development, the innovative technologies that have propelled us forward, and the challenges we’ve overcome to establish our leadership in the world of GPU computing.

Read more ...

Reproduce AMD’s MLPerf Training v5.0 Submission Result with Instinct™ GPUs

04 June 2025

In recent years, large language models (LLMs) have transformed the landscape of natural language processing, enabling breakthroughs in tasks ranging from code generation to answering complex questions. Among these, the Llama 2 model family developed by Meta has emerged as a powerful and versatile set of open weight transformer-based models, known for their competitive performance across diverse NLP benchmarks. With model sizes ranging from 7 billion to 70 billion parameters, Llama 2 has quickly become a popular choice for both research and industry after its release in 2023, striking a balance between scalability and efficiency.

Read more ...

AMD’s MLPerf Training Debut: Optimizing LLM Fine-Tuning with Instinct™ GPUs

04 June 2025

MLPerf Training is one of the most influential benchmarks in the AI community, playing a critical role in measuring and advancing the performance of machine learning training across diverse hardware and software platforms. Established to provide a fair, standardized way to evaluate training speed and efficiency on real-world workloads, MLPerf Training has become the chosen standard for researchers, engineers, and organizations striving to test the boundaries of AI capability. By fostering transparency and innovation, it focuses on progression in both academic research and industry applications, helping the community identify the most effective technologies to power the next generation of intelligent systems.

Read more ...

High-Throughput BERT-L Pre-Training on AMD Instinct™ GPUs: A Practical Guide

03 June 2025

This blog showcases an implementation of the BERT-L model on the AMD Instinct™ GPUs using ROCm with advanced optimization including but not limited to mixed precision training, packed datasets, Flash Attention and MLPerf-compliant techniques. BERT (Bidirectional Encoder Representations from Transformers) is a language representation model developed by researchers at Google in 2018. It is based on the Transformer architecture and processes text bidirectionally, which contrasts with traditional models that read text sequentially.

Read more ...

Scale LLM Inference with Multi-Node Infrastructure

30 May 2025

Horizontal scaling of compute resources has become a critical aspect of modern computing due to the ever-increasing growth in data and computational demands. Unlike vertical scaling, which focuses on enhancing an individual system’s resources, horizontal scaling enables the expansion of a system’s capabilities by adding more instances or nodes working in parallel. In this way, it ensures high availability and low latency of the service, making it essential to handle diverse workloads and ensure optimal user experience.

Read more ...

HIP 7.0 Is Coming: What You Need to Know to Stay Ahead

28 May 2025

20 June 2025

At AMD, we understand that code portability between AMD and NVIDIA GPU programming models is top of mind for our customers. We are committed to making GPU development more seamless and portable across vendors. With the upcoming HIP 7.0 release in second half of 2025, we’re taking a bold step toward simplifying cross-platform programming by aligning HIP C++ even more closely with CUDA. AMD tightly integrates our automatic HIPIFY conversion tool with our HIP runtime and compiler. Users can quickly port CUDA code into HIP C++ with HIPIFY to target AMD GPUs. However, small differences between our implementation of the HIP C++ programming model and CUDA C++ often require manual intervention to adjust your code base. This causes additional work for software developers targeting GPU families from both providers. We understand this and are making changes to ROCm to reduce this friction based on customer requests. We also know adopting changes in our programming model requires early notification. We don’t take API breaking changes lightly and for your benefit, we are making an early prototype available to assist in porting to the new HIP 7.0 API. The preview release is based on ROCm 6.4.1 release for functionality but contains 7.0 API previews. It is intended as a drop-in replacement for 6.4.1 intended for non-production use, enabling users to write code with the new API and adopt HIP 7.0 more smoothly. In this blog, you will learn how HIP 7.0 aligns more closely with CUDA, what API and behavior changes to expect, and how to prepare your codebase to ensure compatibility and portability across GPU platforms. Let’s delve into the details of the API changes.

Read more ...

ROCm Runfile Installer Is Here!

22 May 2025

From ROCm 6.4, and after much user demand, we are introducing the ROCm Runfile Installer method primarily for network secured environments, or those who wish to bypass a native Linux package management system, or those that just want to download and run a single file to install ROCm.

Read more ...

From Theory to Kernel: Implement FlashAttention-v2 with CK-Tile

21 May 2025

In our previous blog, Hands on with CK Tile we walked through how to build a basic GEMM kernel using CK-Tile. In this blog, we will further explore the implementation of a fused kernel, specifically introducing the FlashAttention (FA)-v2 forward kernel. Figure 1 provides an overview of the FlashAttention kernel executions and data movements that occur during the computation of a single thread block of output matrix. Each of the subsequent sections explains details on how to implement this using CK-Tile.

Read more ...

Introducing ROCm-DS: GPU-Accelerated Data Science for AMD Instinct™ GPUs

20 May 2025

AMD is excited to announce the early access release of ROCm-DS (ROCm Data Science), a new toolkit designed to accelerate data processing workloads on AMD Instinct™ GPUs. Built on the core ROCm toolkit, ROCm-DS promises to significantly enhance performance and scalability for data-intensive applications, catering to the pressing needs of today’s data-driven landscape. ROCm-DS is based on the open source libraries in the RAPIDS ecosystem. This collection of libraries enables a multitude of data processing operations, allowing new and existing workloads to tap into the computational advantages offered by AMD Instinct Datacenter GPUs. This early access release introduces two powerful new libraries: hipDF and hipGRAPH.

Read more ...

AMD Integrates llm-d on AMD Instinct MI300X Cluster For Distributed LLM Serving

20 May 2025

AMD has successfully deployed the open-source llm-d framework on AMD Kubernetes infrastructure as part of our efforts for distributed large language model inference at scale. It leverages Kubernetes-native toolkit to streamline LLM serving with features like KV-cache-aware routing, distributed scheduling, and integration with Inference Gateway (IGW). In this blog we showcase initial deployment on an AMD cluster with distributed prefill and decode stages on a Llama model.

Read more ...

Accelerate DeepSeek-R1 Inference: Integrate AITER into SGLang

16 May 2025

To achieve optimized LLM performance on GPUs, high-performance AI operators/kernels are very critical. AMD recently announced AITER, a centralized repository designed to accelerate AI workloads by providing a unified collection of high-performance AI operators. It serves as a comprehensive hub for customer-level operator requests, supporting diverse needs across private, public, or custom frameworks. With both C++ and Python APIs, AITER enables developers to focus on operator development while offering flexible backend kernel implementations using Triton, CK, or assembly. AITER supports inference, training kernels, GEMM, and communication kernels, allowing flexibility across different kernel-framework pairings and architectural limitations. In this blog we will provide a comprehensive, step-by-step hands-on guide on integrating AITER operators into SGLang for DeepSeek-R1. SGLang is a fast serving framework for large language and vision language models. For DeepSeek-R1, SGLang incorporates MLA (Multi-Head Latent Attention) optimizations and supports FP8 precision (specifically W8A8 format). These enhancements enable the identification of target modules that can be replaced with AITER-optimized solutions, improving overall efficiency and performance. AITER integration delivers significant performance improvements across the entire inference pipeline while maintaining full functional equivalence with the original architecture.

Read more ...

Step-Video-T2V Inference with xDiT on AMD Instinct MI300X GPUs

15 May 2025

The Stepfun Step-Video-T2V is a 30B parameter state-of-the-art text-to-video (T2V) model capable of generating high-quality videos of up to 204 frames. As video generation advances toward Artificial General Intelligence (AGI), such models play a key role in automating and democratizing video creation. In this blog, we introduce Step-Video-T2V with xDiT running efficiently out-of-the-box on multi-GPU systems powered by AMD Instinct™ MI300X, leveraging high-bandwidth memory and ROCm ™ for fast, scalable video generation.

Read more ...

Accelerated JPEG decoding on AMD Instinct™ GPUs with rocJPEG

12 May 2025

With the increased growth in dataset sizes, the improvement of image capturing technology, the capacity to extract more information from visual data, and the move towards large language models including image data as input, efficient image processing and preparation has become a necessity to run these workloads in a timely manner. Although much attention is often focused on the computational aspects of these workloads, the fundamental tasks of data loading and preparation have become significant bottlenecks, limiting the throughput of the entire pipeline. Accelerated JPEG decoding is an essential step in optimizing workloads that rely on image data. Dive into this blog post to learn how to install and benchmark rocJPEG, as well as how the ROCm™ platform and AMD Instinct GPUs can help you achieve up to 50x faster decoding performance in 4k¹.

Read more ...

DataFrame Acceleration: hipDF and hipDF.pandas on AMD GPUs

07 May 2025

In our previous blog CuPy and hipDF on AMD: The Basics and beyond, we explored the fundamentals of hipDF and demonstrated the significant speed up it provides when compared to Pandas for data manipulation tasks, particularly when AMD GPUs are used.

Read more ...

Unleash Full GPU Potential: Overlap Communication and Computation with Triton-Distributed

06 May 2025

In distributed computing, AI workloads demand both massive parallelism and efficient data movement. A primary challenge lies in effectively overlapping computation with communication to maximize performance. GPUs are excellent at crunching numbers. However, their full potential often remains untapped due to relatively long inter-GPU communication. This results in their computing units staying idle for large amounts of time while waiting for data transfer from other nodes. In this blog, we will show how you can use the Triton-Distributed framework to generate kernels that overlap communication and computation, resulting in performance that can rival highly optimized libraries.

Read more ...

CuPy and hipDF on AMD: The Basics and Beyond

06 May 2025

This blog introduces you to CuPy and hipDF, two GPU-oriented high-performance computing Python libraries. This blog will show you how to deploy CuPy and hipDF on AMD GPUs using ROCm, and demonstrate the advantages of CuPy and hipDF over their traditional CPU-orientated counterparts, NumPy and Pandas.

Read more ...

Optimizing DeepseekV3 Inference on SGLang Using ROCm Profiling Tools

01 May 2025

As LLMs are growing in size and complexity, ensuring proper utilization of compute resources becomes of prime importance. Performance profiling and kernel-level analysis are essential techniques for diagnosing runtime bottlenecks, such as GPU time, memory-bound operations, and inefficient device-host memory transfers etc. By using profiling tools like RocmProfileData (RPD) and TorchProfiler (PyTorch Profiler) developers have access to granular level insight into kernel execution timelines, data movement patterns, and computational hotspots. In this blog, we delve into how profiling and kernel diagnostics can expose inefficiencies in components like attention mechanisms and Mixture-of-Experts (MoE) layers — and guide targeted optimizations at the kernel level.

Read more ...

Power Up Qwen 3 with AMD Instinct: A Developer’s Day 0 Quickstart

28 April 2025

AMD is excited to announce Day 0 support for Alibaba’s latest Large Language Models Qwen3-235B Qwen3-32B Qwen3-30B on AMD Instinct™ MI300X GPU accelerators using vLLM and SGLang. In this blog we show you how to accelerate Alibaba’s cutting-edge Qwen 3 language models, featuring advanced reasoning, multilingual capabilities, and agent functionality, using AMD Instinct™ MI300X GPUs. You will learn to deploy dense and Mixture-of-Experts models with full support for vLLM and SGLang, leveraging AMD’s advanced GPU architecture for high-throughput, low-latency inference.

Read more ...

Boosting Llama 4 Inference Performance with AMD Instinct MI300X GPUs

28 April 2025

In our previous blog post, we explored how to deploy Llama 4 using AMD Instinct™ MI300X GPUs with vLLM. We also highlighted that MI300X and MI325X GPUs are capable of running the full 400B-parameter Llama 4 Maverick model in BF16 precision on a single node, significantly reducing infrastructure complexity. Their substantial HBM memory capacity further supports extended context lengths, enabling high throughput and efficient model execution.

Read more ...

Beyond Text: Accelerating Multimodal AI Inference with Speculative Decoding on AMD Instinct™ MI300X GPUs

28 April 2025

In the rapidly evolving landscape of artificial intelligence, multimodal models have emerged as powerful tools capable of processing and generating content across different modalities—text, images, audio, and more. Meta’s recent release of the multimodal Llama 4 models, including Llama 4 Scout and Llama 4 Maverick, exemplifies this advancement. Despite their impressive functionalities, such models face significant computational challenges, particularly in generation speed and resource efficiency due to a much larger context length compared to text-only models. Enter speculative decoding: a promising technique that has revolutionized text generation in large language models and is now finding exciting applications in multimodal contexts. Speculative decoding allows AI models to generate outputs faster by speculating several steps ahead and confirming predictions in fewer passes. In this blog you will learn, step-by-step, how speculative decoding can help you unlock significant inference speedups for multimodal systems while maintaining output quality using ROCm on AMD Instinct MI300X GPUs.

Read more ...

Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration

24 April 2025

In this blog post, we provide an overview of Volcano Engine Reinforcement Learning for LLMs (verl) and discuss its benefits in large-scale reinforcement learning from human feedback (RLHF). We also detail the modifications made to the codebase to optimize verl’s performance on AMD Instinct GPUs. Next, we walk through the process of building the Docker image using a Dockerfile on the user side, along with training scripts tailored for both single-node and multi-node setups. Lastly, we present verl’s performance results, focusing on throughput and convergence accuracy achieved on AMD Instinct™ MI300X GPUs. Follow this guide to get started with verl on AMD Instinct GPUs and accelerate your RLHF training with ROCm-optimized performance.

Read more ...

A Step-by-Step Guide On How To Deploy Llama Stack on AMD Instinct™ GPU

22 April 2025

As a leader in high-performance computing, AMD empowers AI innovation by providing open-source tools and hardware acceleration for scalable model deployment. In this blog we will show you how this foundation can be leveraged to deploy Meta’s LLMs efficiently on AMD Instinct™ GPUs. Meta’s Llama series has democratized access to large language models, empowering developers worldwide. The Llama Stack—Meta’s all-in-one deployment framework—extends this vision by enabling seamless transitions from research to production through built-in tools for optimization, API integration, and scalability. This unified platform is ideal for teams requiring robust support to deploy Meta’s models at scale across diverse applications.

Read more ...

Hands-On with CK-Tile: Develop and Run Optimized GEMM on AMD GPUs

15 April 2025

Composable Kernel (CK-Tile) for ROCm is used to build portable high-performance kernels for accelerating computing, e.g. HPC, DL and LLMs for training and inference workloads. CK-Tile APIs consist of vendor optimized kernels like GEMM, BatchGemm, fused-MHA, fused-MoE, SmoothQuant, element-wise kernels and many other kernels. This blog focuses on creating the most commonly used GEMM kernel, incorporating a vendor-optimized kernel pipeline and policies, and covers key CK-Tile concepts for quick learning.

Read more ...

Installing ROCm from source with Spack

14 April 2025

In this guide you will learn how Spack makes building ROCm components from source easier and more flexible than other methods. This blog will walk you through installing ROCm from source using the Spack package manager. We will also discuss Spack’s place among other ROCm installation methods, the landscape of ROCm components, and show you how ROCm, as an open-source software platform, allows developers to streamline software stacks for their applications.

Read more ...

ROCm Gets Modular: Meet the Instinct Datacenter GPU Driver

11 April 2025

Today ROCm is synonymous with software for AMD’s Instinct GPUs. ROCm describes everything from the driver to the runtime to the libraries that enable AI and HPC software stacks. Starting in ROCm 6.4, we expand our software family to include the Instinct Datacenter GPU driver. The Instinct driver bifurcates from the current ROCm driver with a separate release process including an independent version number scheme, a new documentation site, and a laser focus on enabling applications on our datacenter GPU products. This change is depicted in the figure below.

Read more ...

ROCm 6.4: Breaking Barriers in AI, HPC, and Modular GPU Software

11 April 2025

In the rapidly evolving landscape of high-performance computing and artificial intelligence, innovation is the currency of progress. AMD’s ROCm 6.4 isn’t just another software update—it’s a leap forward that redefines the boundaries of what is possible for AI, developers, researchers, and enterprise innovators.

Read more ...

Unlock Peak Performance on AMD GPUs with Triton Kernel Optimizations

10 April 2025

Triton is a domain-specific programming language designed to simplify GPU programming for high-performance tasks, particularly in AI applications. It provides an open-source environment that enables users to write high-level Triton code with greater productivity compared to Nvidia CUDA or AMD HIP. The Triton compiler translates Triton code into optimized GPUs instructions, effectively compiling tensor operations into low-level GPU code. It achieves high efficiency through multiple optimizations passes and leverages the underlying architecture of the GPU. To optimize GPU performance, it is important to have a solid understanding of the Triton compiler and the role it plays in kernel performance. In this blog, we will deep dive into the AMD Triton compiler, introduce Triton kernel compilation, and provide insights on how to create an efficient Triton kernel code.

Read more ...

Shrink LLMs, Boost Inference: INT4 Quantization on AMD GPUs with GPTQModel

09 April 2025

GPTQ (Generalized Post Training Quantization) is a technique for compressing Large Language Models (LLMs) after they have been fully trained by reducing their numerical precision. The objective of compressing the model is to reduce its memory footprint and computational requirements, making it easier to deploy it on hardware with limited resources.

Read more ...

Power Up Llama 4 with AMD Instinct: A Developer’s Day 0 Quickstart

06 April 2025

08 April 2025

AMD is excited to announce Day 0 support for Meta’s latest leading multimodal intelligence Models — the Llama 4 Maverick and Scout models — on our AMD Instinct™ MI300X and MI325X GPU accelerators using vLLM. In this blog we will walk you through a step-by-step guide on deploying Meta’s Llama4 model using vLLM, docker setup, dependencies, and inference testing.

Read more ...

Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.0 Submission

02 April 2025

Building upon the success of our MLPerf Inference v4.1 submission, AMD has submitted results for two popular models – Llama 2 70B and Stable Diffusion XL (SDXL) – in the MLPerf Inference v5.0 round. This blog post provides a comprehensive, step-by-step guide on reproducing the results of AMD’s MLPerf submission using ROCm and the AMD Instinct™ MI325X GPUs. Please follow along to independently verify these results and gain hands-on experience with the benchmarking process. If you are interested in learning more about the advanced optimization strategies behind our Llama 2 70B and SDXL inference, from quantization and General Matrix Multiplication (GEMM) tuning to cutting-edge vLLM scheduling and platform enhancements, check out our blog on MLPerf Inference v5.0 optimization strategies.

Read more ...

AMD Instinct™ MI325X GPUs Produce Strong Performance in MLPerf Inference v5.0

02 April 2025

AI transformation and its ever-increasing demands of GenAI, LLMs, reasoning models and new advances in inference and training emphasize the need for innovative GPU architectures and products designed and delivered at an accelerated pace. Understanding the performance of AI models on these GPUs is critical for continuous advances in AI deployments and adoption. However, benchmarking AI models is challenging due to their inherent complexity and variety of possible deployments and tasks. Approaching this problem from a cross-industry perspective is preferable to have a benchmark that is comparable across different platforms and vendors. MLPerf is such a benchmark created by a cross-industry MLCommons consortium of which AMD is a founding member.

Read more ...

What’s New in the AMD GPU Operator v1.2.0 Release

28 March 2025

The GPU Operator v1.2.0 release introduces significant new features, including GPU health monitoring, automated component and driver upgrades, and a new device test runner component for enhanced validation and troubleshooting. These improvements aim to increase reliability, streamline upgrades, and provide enhanced visibility into GPU health.

Read more ...

Bring FLUX to Life on MI300X: Run and Optimize with Hugging Face Diffusers

28 March 2025

AI based text-to-image generation is pushing the boundaries of creative and visual storytelling, enabling the critical mass to draw like an artist. Stability AI introduced stable diffusion models which was a breakthrough in text to image generation. However, FLUX - a new state-of-the-art open-source model released by Black Forest Labs, is gaining popularity for its flexibility and controllability.

Read more ...

Accelerating LLM Inference: Up to 3x Speedup on MI300X with Speculative Decoding

27 March 2025

In this blog you will learn how speculative decoding boosts LLM inference, providing out-of-the-box speedups in LLM token generation on the AMD Instinct™ MI300X GPU. We start the blog by providing you with a brief overview of Speculative Decoding. We then demonstrate, through extensive benchmarking on a number of LLMs and datasets, as well as on different frameworks viz. vLLM and native PyTorch (gpt-fast), speedups in the range of 1.3x - 3x in the LLM generation throughput (tokens/second) through speculative decoding as compared to running a vanilla LLM for batch size 1. We show you how these speedups vary for batch sizes greater than 1 in vLLM. Finally, we will share a detailed profiling-based case study to identify some high-level differences between these two frameworks, i.e. the type of kernels that are launched and their overall latencies, which are critical differentiators between the performance of these frameworks. Let’s get started!

Read more ...

Introducing ROCprofiler SDK - The Latest Toolkit for Performance Profiling

25 March 2025

Profiling is the backbone of performance optimization in AI and HPC workloads, enabling developers to extract maximum efficiency from AMD Instinct™ GPUs. With ROCm’s rapid evolution, the need for a unified, scalable, and extensible profiling framework has never been more critical. The new ROCprofiler-SDK framework represents a significant step forward in profiling capabilities, offering enhanced features, streamlined integration, and a better user experience while also solving past limitations with former profiler interface versions. This guide aims to help users seamlessly transition from legacy profiling tools to the ROCprofiler-SDK infrastructure. We will explore new features, highlight key differences from previous tools, and provide actionable steps for a smooth migration.

Read more ...

Introducing ROCprofiler SDK - The Latest Toolkit for Performance Profiling

25 March 2025

Profiling is the backbone of performance optimization in AI and HPC workloads, enabling developers to extract maximum efficiency from AMD Instinct™ GPUs. With ROCm’s rapid evolution, the need for a unified, scalable, and extensible profiling framework has never been more critical. The new ROCprofiler-SDK framework represents a significant step forward in profiling capabilities, offering enhanced features, streamlined integration, and a better user experience while also solving past limitations with former profiler interface versions. This guide aims to help users seamlessly transition from legacy profiling tools to the ROCprofiler-SDK infrastructure. We will explore new features, highlight key differences from previous tools, and provide actionable steps for a smooth migration.

Read more ...

Speculative Decoding - Deep Dive

24 March 2025

Nowadays, LLM serving has become an increasingly popular service in the technology industry, with thousands of requests being sent to LLM servers, and responses generated and sent back to clients all over the world. The performance of online serving, as one of the key metrics to evaluate its user experience and service quality, has grabbed attention from both of the industry and academia.

Read more ...

Efficient MoE training on AMD ROCm: How-to use Megablocks on AMD GPUs

23 March 2025

Training massive deep-learning models requires a balance of efficiency and scalability. In the context of the Transformers architecture, Mixture of Experts (MoE) models are massive machine learning architectures characterized for dividing tasks among multiple specialized sub-networks or “experts”. A gating network determines the expert to which a given input should be routed, enabling the model to handle complex tasks more efficiently by using the specialized capabilities of each expert. This dynamic routing mechanism allows MoE models to scale efficiently, activating only a subset of the network for each input, therefore reducing computational load while maintaining high model capacity.

Read more ...

Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X

21 March 2025

Our previous blog post on this topic discussed how DeepSeek-R1 achieves competitive performance on AMD Instinct™ MI300X GPUs. We also included performance comparisons against Nvidia H200 GPUs and a short demo application illustrating real-world usage. In this blog we will delve into how using the SGLang framework, critical kernel optimizations like AI Tensor Engine for ROCm™, and hyperparameter tuning helps to achieve performance boosts.

Read more ...

AITER: AI Tensor Engine For ROCm

21 March 2025

24 March 2025

Performance optimization is critical when working with GPUs, especially for tasks involving artificial intelligence, which can be extremely demanding. To fully leverage the capabilities of advanced hardware, it’s essential to master optimization strategies and ensure every available resource is utilized efficiently. In this blog we will provide an overview of AMD’s AI Tensor Engine for ROCm (AITER) and show you how easy it is to integrate AITER kernels in basic LLM training and inference workload. AITER helps developers to focus on creating operators while allowing customers to seamlessly integrate this operator collection into their own private, public, or any custom framework.

Read more ...

Deploying Google’s Gemma 3 Model with vLLM on AMD Instinct™ MI300X GPUs: A Step-by-Step Guide

14 March 2025

AMD is excited to announce the integration of Google’s Gemma 3 models with AMD Instinct MI300X GPUs, optimized for high-performance inference using the vLLM framework. This collaboration empowers developers to harness advanced AMD AI hardware for scalable, efficient deployment of state-of-the-art language models. In this blog we will walk you through a step-by-step guide on deploying Google’s Gemma 3 model using vLLM on AMD Instinct GPUs, covering Docker setup, dependencies, authentication, and inference testing. Remember, the Gemma 3 model is gated—ensure you request access before beginning deployment.

Read more ...

Analyzing the Impact of Tensor Parallelism Configurations on LLM Inference Performance

14 March 2025

As AI models continue to scale in size and complexity, deploying them efficiently requires strategic resource allocation. Tensor parallelism (TP) is a valuable technique for distributing workloads across multiple GPUs, reducing memory constraints, and enabling inference for large-scale models. However, the choice of TP configuration isn’t one-size-fits-all—it directly impacts performance, networking overhead, and cost efficiency.

Read more ...

Optimized ROCm Docker for Distributed AI Training

13 March 2025

This blog will introduce you to the updated AMD Docker image, specifically built and optimized for distributed training. As you will see, the optimized AMD ROCm Docker image makes training large AI models faster and more efficient. It includes updates such as better fine-tuning tools, improved performance for multi-GPU setups, and support for FP8 precision, which helps speed up training while using less memory, and can provide you with an overall smoother and more efficient training experience on popular models such as Flux and Llama 3.1 running on AMD GPUs.

Read more ...

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 3

13 March 2025

Welcome back to the final part of our series! So far, we’ve successfully setup up a Kubernetes cluster and installed the AMD GPU Operator to seamlessly integrate AMD hardware with Kubernetes in Part 1. We’ve deployed vLLM on AMD Instinct MI300X GPUs, exposed it using MetalLB, and scaled it efficiently in Part 2.

Read more ...

AMD Advances Enterprise AI Through OPEA Integration

12 March 2025

AMD is excited to support Open Platform for Enterprise AI (OPEA) to simplify and accelerate enterprise AI adoption. With the enablement of OPEA GenAI framework on AMD ROCm™ software stack, businesses and developers can now create scalable, efficient GenAI applications on AMD data center GPUs. Enterprises today face significant challenges when deploying AI at scale, including the complexity of integrating GenAI models, managing GPU resources, ensuring security, and maintaining workflow flexibility. AMD and OPEA aim to address these challenges and streamline AI adoption. This blog will explore the significance of this collaboration, AMD’s contribution to the OPEA project, and demonstrate how to deploy a code translation OPEA GenAI use case on the AMD Instinct™ MI300X GPU.

Read more ...

Instella-VL-1B: First AMD Vision Language Model

07 March 2025

As part of AMD’s newly released Instella family we are thrilled to introduce Instella-VL-1B, the first AMD vision language model for image understanding trained on AMD Instinct™ MI300X GPUs. Our journey with Instella-VL builds upon our previous 1-billion-parameter language models, AMD OLMo SFT. We further extend the language model’s visual understanding abilities by connecting it with a vision encoder (which is initialized from CLIP ViT-L/14-336). During training, we jointly finetune vision encoder and language model with vision-language data in three stages: Alignment, Pretraining and Supervised-Finetuning (SFT).

Read more ...

Introducing Instella: New State-of-the-art Fully Open 3B Language Models

05 March 2025

AMD is excited to announce Instella, a family of fully open state-of-the-art 3-billion-parameter language models (LMs) trained from scratch on AMD Instinct™ MI300X GPUs. Instella models outperform existing fully open models of similar sizes and achieve competitive performance compared to state-of-the-art open-weight models such as Llama-3.2-3B, Gemma-2-2B, and Qwen-2.5-3B, including their instruction-tuned counterparts.

Read more ...

Understanding RCCL Bandwidth and xGMI Performance on AMD Instinct™ MI300X

02 March 2025

Efficient inter-GPU communication is the backbone of high-performance AI and HPC workloads, where technologies like RCCL and xGMI play pivotal roles. However, some limitations in achieving theoretical peak bandwidth have raised questions about performance bottlenecks. In this blog we explain the limitations to achieve the theoretical maximum bandwidth in multi-GPU clusters, and teach you how to perform a set of diagnostics and performance-tuning strategies that will help you optimize RCCL and xGMI bandwidth on AMD MI300X systems. We will first introduce you to xGMI and its performance constraints, to RCCL and its bandwidth limitations, and then cover several practical benchmarks and best practices for maximizing RCCL efficiency.

Read more ...

Measuring Max-Achievable FLOPs – Part 2

28 February 2025

In our previous blog post, we explored the conceptual differences between Peak FLOPs and Max-Achievable FLOPs (MAF), explaining why the gap between these metrics has widened with modern ML-optimized hardware. This second installment provides a detailed methodology for measuring MAF on AMD GPUs, including the specific environmental conditions, matrix size optimization techniques, and tools required for accurate measurement. We present the actual MAF results for AMD Instinct MI300X and MI325X GPUs across different precision formats (FP16, BF16, and FP8) along with their corresponding median frequencies. We also explain how software efficiency and frequency management impact MAF, and demonstrate why boost clock capabilities remain important for latency-sensitive workloads such as LLM inference with small batch sizes.

Read more ...

Deploying Serverless AI Inference on AMD GPU Clusters

25 February 2025

Deploying Large Language Models (LLMs) in enterprise environments presents a multitude of challenges that organizations must navigate to harness their full potential. As enterprises expand their AI and HPC workloads, scaling the underlying compute and GPU infrastructure presents numerous challenges, including deployment complexities, resource optimization, and effective management of the compute resource fleet. In this blog, we will walk you through how to spin-up production-grade Serverless AI inference service on Kubernetes clusters by leveraging open source Knative/KServe technologies.

Read more ...

Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU

21 February 2025

In this blog, we explore how DeepSeek-R1 achieves competitive performance on AMD Instinct™ MI300X GPUs, along with performance comparisons to H200 and a short demo application showcasing real-world usage. By leveraging MI300X, users can deploy DeepSeek-R1 and V3 models on a single node with impressive efficiency. In just two weeks, optimizations using SGLang have unlocked up to a 4X boost in inference speed, ensuring efficient scaling, lower latency, and optimized throughput. The MI300X’s high-bandwidth memory (HBM) and compute power enable execution of complex AI workloads, handling longer sequences and demanding reasoning tasks. With AMD and the SGLang community driving ongoing optimizations—including fused MoE kernels, MLA kernel fusion, and speculative decoding—MI300X is set to deliver an even more powerful AI inference experience.

Read more ...

How to Build a vLLM Container for Inference and Benchmarking

21 February 2025

Welcome back! If you’ve been following along with this series, you’ve already learned about the basics of ROCm containers. Today, we’ll build on that foundation by creating a container for large language model inference with vLLM.

Read more ...

Fine-tuning Phi-3.5-mini LLM at scale: Harnessing Accelerate and Slurm for multinode training

19 February 2025

In this blog you will learn the process of fine-tuning the Phi-3.5-mini-instruct Large Language Model (LLM) from Microsoft, using PyTorch in a multinode environment. The setup leverages the Hugging Face Accelerate library to handle the complexities of multi-GPU and multinode synchronization. Slurm is used to schedule and coordinate the job as a workload manager for high-performance computing environments. A custom Slurm Bash script launches the Docker containers on each node, ensuring the training environment is consistent across all machines. Inside the containers, PyTorch and the Accelerate library split the training data, synchronize the model updates, and optimize performance across the multinode setup. This approach lets you efficiently fine-tune large-scale models and reduce training time while maximizing hardware utilization across the entire cluster.

Read more ...

Understanding Peak, Max-Achievable & Delivered FLOPs, Part 1

14 February 2025

The purpose of this blog post is to provide information on the differences between Peak FLOPs and Max-achievable FLOPs. After reading, users will know how AMD measures maximum delivered performance, and how AMD recommends measured device performance is used.

Read more ...

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 2

14 February 2025

Welcome to Part 2 of our series on utilizing Kubernetes with the AMD Instinct platform! If you’re just joining us, we recommend checking out Part 1 where we covered setting up your Kubernetes cluster and enabling AMD GPU support.

Read more ...

Navigating vLLM Inference with ROCm and Kubernetes

13 February 2025

Kubernetes (often abbreviated as K8s) is an open-source platform designed for automating the deployment, scaling, and management of containerized applications. Developed by Google and now maintained by the Cloud Native Computing Foundation, Kubernetes enables developers to build, run, and manage applications across any infrastructure.

Read more ...

PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm

09 February 2025

PyTorch Fully Sharded Data Parallel (FSDP) is a data parallelism technique that enables the training of large-scale models in a memory-efficient manner. FSDP achieves this memory efficiency by sharding model parameters, optimizer states, and/or gradients across GPUs, reducing the memory footprint required by each GPU. This enables the training of large-scale models with lower total GPU memory than DDP (Distributed Data Parallel), in which the model weights and optimizer states are replicated across all processes. To learn more about DDP, refer to Distributed Data Parallel (DDP) training on AMD GPU with ROCm.

Read more ...

MI300A - Exploring the APU advantage

09 February 2025

This blog post will introduce you to the advantages of AMD Instinct™ MI300A accelerated processing unit (APU), discussing the hardware architecture and how to leverage its GPU programming capabilities.

Read more ...

Deep dive into the MI300 compute and memory partition modes

09 February 2025

This blog introduces the inner compute and memory architecture of the AMD Instinct™ MI300, showing you how to use the MI300 GPU’s different partition modes to supercharge performance critical applications. In this blog, you will first get a brief introduction to the MI300 architecture, explaining how the MI300 compute and memory partitions can be used to your advantage. You will then learn in detail the compute partitioning modes and the memory partitioning modes, Further, two case studies demonstrate and benchmark the performance of the different modes. For convenience this blog uses the MI300X as a case-in-point example.

Read more ...

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 1

07 February 2025

As organizations scale their AI inference workloads, they face the challenge of efficiently deploying and managing large language models across GPU infrastructure. This three-part blog series provides a production-ready foundation for orchestrating AI inference workloads on the AMD Instinct platform with Kubernetes.

Read more ...

GEMM Kernel Optimization For AMD GPUs

06 February 2025

Matrix multiplication underlies critical computational pathways in AI, with General Matrix Multiplication (GEMM) operations serving as performance-critical kernels in neural network architectures. From fully connected layers to convolutions and transformer attention mechanisms, GEMMs consume substantial computational and memory resources in large language models (LLMs). This blog explores GEMM optimization techniques for AMD GPUs, demonstrating methodologies to significantly enhance computational efficiency and performance scaling.

Read more ...

Enhancing AI Training with AMD ROCm Software

31 January 2025

ROCm™ has emerged as a premier open software stack designed to address the evolving needs of AI and machine learning workloads. Built for inference and training, ROCm delivers leadership performance, empowering developers and organizations to optimize their workloads for efficiency, scalability, and cost-effectiveness.

Read more ...

Best practices for competitive inference optimization on AMD Instinct™ MI300X GPUs

29 January 2025

Optimizing LLM performance on GPUs is challenging due to diverse model needs, memory constraints, and balancing latency and throughput. This document examines how hardware utilization, memory and communication bandwidth and scaling, contribute to inference performance, detailing optimal configurations for AMD Instinct™ MI300X GPUs.

Read more ...

Announcing the AMD GPU Operator and Metrics Exporter

29 January 2025

As AI workloads continue to grow in complexity and scale, we’ve consistently heard one thing from our customers: “Managing GPU infrastructure shouldn’t be the hard part”. For many, this is where Kubernetes comes into play. Kubernetes allows customers to easily manage and deploy their AI workloads at scale by providing a robust platform for automating deployment, scaling, and operations of application containers across clusters of hosts. It ensures that your applications run consistently and reliably, regardless of the underlying infrastructure. A pod is the smallest and simplest Kubernetes object. It represents a single instance of a running process in your cluster and can contain one or more containers. Pods are used to host your application workloads and are managed by Kubernetes to ensure they run as expected. Having pods be able to leverage GPUs on your cluster, however, is not something that is trivial.

Read more ...

Distributed fine-tuning of MPT-30B using Composer on AMD GPUs

28 January 2025

Composer, developed by MosaicML, is an open-source deep learning training library built on top of PyTorch, designed to simplify and optimize distributed training workflows. It supports scalable training on multiple nodes and efficiently handles datasets of various sizes. Composer integrates advanced techniques such as PyTorch Fully Sharded Data Parallelism (FSDP), elastic sharded checkpointing, training callbacks, and speed-up algorithms to enhance training performance and flexibility. It closely resembles PyTorch’s torchrun and has demonstrated exceptional efficiency when scaling to hundreds of GPUs.

Read more ...

Vision Mamba on AMD GPU with ROCm

24 January 2025

State Space Models (SSMs), such as Mamba, have emerged as a potential alternative to Transformer models. Vision backbones using only SSMs have yielded promising results. For more information about SSMs and Mamba’s performance on AMD hardware, see Mamba on AMD GPUs with ROCm. This blog explores Vision Mamba (Vim), an innovative and efficient backbone for vision tasks and evaluate its performance on AMD GPUs with ROCm. We’ll start with a brief introduction to Vision Mamba, followed by a step-by-step guide on training and running inference with Vision Mamba on AMD GPUs using ROCm.

Read more ...

Getting started with AMD ROCm containers: from base images to custom solutions

16 January 2025

Having worked in technology for over two decades, I’ve witnessed firsthand how containerization has transformed the way we develop and deploy applications. Containers package applications with their dependencies into standardized units, making software portable and consistent across different environments. When we combine this containerization power with AMD Instinct™ Accelerators, we get a powerful solution for quickly deploying AI and machine learning workloads. In this blog, the first in a series exploring ROCm containerization, I want to share my insights about AMD ROCm™ containers and show you how to build and customize your own GPU-accelerated workloads. You’ll learn how to select appropriate base images, modify containers for your specific needs, and implement best practices for GPU-enabled containerization - all with hands-on examples you can try yourself.

Read more ...

Boosting Computational Fluid Dynamics Performance with AMD Instinct™ MI300X

14 January 2025

This blog will guide you, step-by-step, through the process of installing and running benchmarks with Ansys Fluent and AMD MI300X. We start with an overview of the Ansys Fluent CFD application and then show you how to set up an AMD MI300X system to run benchmarks. The blog benchmarks results demonstrate the dramatic impact the MI300X has on speeding up simulations, improving design efficiency, and reducing costs in the automotive, aerospace, and environmental engineering industries.

Read more ...

Triton Inference Server with vLLM on AMD GPUs

08 January 2025

Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained AI models from various machine learning and deep learning frameworks including Tensorflow, PyTorch, and vLLM, making it adaptable for diverse AI workloads. It is designed to work across multiple environments, including cloud, data centers and edge devices.

Read more ...

Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators

10 December 2024

This blog is contributed by Zyphra: a Palo Alto-based AI research lab and AMD Instinct Partner.

Read more ...

Transformer based Encoder-Decoder models for image-captioning on AMD GPUs

03 December 2024

Image captioning, or the GenAI-based automatic generation of concise textual descriptions of images, has immensely important real-world applications. For example, image captioning can provide visually impaired users with textual descriptions of images for improved accessibility, image captioning can add textual descriptions to products in e-commerce applications and help children map images to their textual descriptions in early childhood educational apps. Image captioning can automatically describe objects and events in security camera footage in surveillance applications and can enable robots to auto-generate textual captions for objects and events they encountered in human-robot interaction (HRI) applications, and many more applications. Image captioning is a sequence-to-sequence (seq2seq) machine learning task: a model converting a sequence from one domain (in this case, the image), to another (its textual description). In image captioning the image is partitioned into a sequence of patches. This sequence of image patches is then converted by the model to a corresponding sequence of text tokens.

Read more ...

SGLang: Fast Serving Framework for Large Language and Vision-Language Models on AMD Instinct GPUs

13 November 2024

In the rapidly evolving landscape of artificial intelligence, the ability to deploy large language models (LLMs) and vision-language models (VLMs) efficiently is crucial for real-time applications. SGLang is an open-source framework designed to meet these demands by delivering fast backend runtime, a flexible frontend language, and extensive model support for a variety of LLMs and VLMs.

Read more ...

Quantized 8-bit LLM training and inference using bitsandbytes on AMD GPUs

13 November 2024

In this blog post we will cover the bitsandbytes 8-bit representations. As you will see, the bitsandbytes 8-bit representations significantly help reduce the memory needed for fine-tuning and inferencing LLMs. There are many quantization techniques used in the field to decrease a model size, but bitsandbytes offers quantization to decrease the size of optimizer states as well. This post will help you understand the basic principles underlying the bitsandbytes 8-bit representations, explain the bitsandbytes 8-bit optimizer and LLM.int8 techniques, and show you how to implement these on AMD GPUs using ROCm.

Read more ...

Introducing AMD’s Next-Gen Fortran Compiler

13 November 2024

We are excited to share a brief preview of AMD’s Next-Gen Fortran Compiler, our new open source Fortran complier supporting OpenMP offloading. AMD’s Next-Gen Fortran Compiler is a downstream flavor of LLVM Flang, optimized for AMD GPUs. Our Next-Gen Fortran Compiler enables OpenMP offloading and offers a direct interface to ROCm and HIP. In this blog post you will:

Read more ...

Distributed Data Parallel Training on AMD GPU with ROCm

01 November 2024

With the increase in complexity and size of machine learning models, the demand for computational resources grows. Training on a single GPU can become a bottleneck for deep learning applications, especially with large datasets and models that are slow to train on a single GPU. Parallelized training addresses this challenge. Out of the various forms of parallelized training, this blog focuses on Distributed Data Parallel (DDP), a key feature in PyTorch that accelerates training across multiple GPUs and nodes.

Read more ...

Torchtune on AMD GPUs How-To Guide: Fine-tuning and Scaling LLMs with Multi-GPU Power

24 October 2024

This blog provides a thorough how-to guide on using Torchtune to fine-tune and scale large language models (LLMs) with AMD GPUs. Torchtune is a PyTorch library designed to let you easily fine-tune and experiment with LLMs. Using Torchtune’s flexibility and scalability, we show you how to fine-tune the Llama-3.1-8B model for summarization tasks using the EdinburghNLP/xsum dataset. Using LoRA(Low-Rank Adaptation), a parameter-efficient fine-tuning technique, Torchtune enables efficient training while maintaining performance across a different number of GPUs (2, 4, 6, and 8). This post also highlights how Torchtune’s distributed training capabilities allow users to scale up LLM fine-tuning on multiple GPUs to reduce training time while maintaining the quality of the trained model, demonstrating its potential and usage on modern AMD hardware using ROCm.

Read more ...

CTranslate2: Efficient Inference with Transformer Models on AMD GPUs

24 October 2024

Transformer models have revolutionized natural language processing (NLP) by delivering high-performance results in tasks like machine translation, text summarization, text generation, and speech recognition. However, deploying these models in production can be challenging due to their high computational and memory requirements. CTranslate2 addresses these challenges by providing a custom runtime that implements various optimization techniques to accelerate Transformer models during inference.

Read more ...

Inference with Llama 3.2 Vision LLMs on AMD GPUs Using ROCm

23 October 2024

Meta’s Llama models now support multimodal capabilities, expanding their functionality beyond traditional text-only applications. The Llama 3.2 models are available in a range of sizes, including medium-sized 11B and 90B multimodal models for vision-text reasoning tasks, and lightweight 1B and 3B text-only models designed for edge and mobile devices.

Read more ...

Speed Up Text Generation with Speculative Sampling on AMD GPUs

15 October 2024

24 February 2025

As the size of transformer models grow, so does the cost of conducting inference, impacting latency and throughput. Compression methods such as quantization and distillation, as well as hardware-aware optimizations such as Flash Attention and Triton, have been proposed to cut down the computation cost at different levels. However, these models either compromise on accuracy or require major changes to the model implementation.

Read more ...

Multinode Fine-Tuning of Stable Diffusion XL on AMD GPUs with Hugging Face Accelerate and OCI’s Kubernetes Engine (OKE)

15 October 2024

As the scale and complexity of generative AI and deep learning models grow, multinode training, basically dividing a training job across several processors, has become an essential strategy to speed up training and fine-tuning processes of large generative AI models like SDXL. By distributing the training workload across multiple GPUs on multiple nodes, multinode setups can significantly accelerate the training process. In this blog post we will show you, step-by step, how to set-up and fine-tune a Stable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI) Kubernetes Engine (OKE) on AMD GPUs using ROCm.

Read more ...

Supercharging JAX with Triton Kernels on AMD GPUs

09 October 2024

Ready to supercharge your deep learning applications on AMD GPUs? In this blog, we’ll show you how to develop a custom fused dropout activation kernel for matrices in Triton, seamlessly call it from JAX, and benchmark its performance with ROCm. This powerful combination will take your model’s performance to the next level.

Read more ...

Leaner LLM Inference with INT8 Quantization on AMD GPUs using PyTorch

03 October 2024

With the scale of large language models (LLMs) reaching hundred of billions of parameters, the ways we represent data within these enormous models dramatically impacts the resources required to train them (e.g. the number of GPUs needed for inference). In our previous blogs (JAX mixed precision training; PyTorch AMP), we already demonstrated how mixed precision training can accelerate LLMs training process. In this blog post we will push things further and show you how quantization into an even lower precision data formats can speed up inference, saving time and memory, without sacrificing the overall performance of the model. Quantization is a technique where the precision of a model’s parameters is reduced from a 32-bit floating point (FP32) or a 16-bit floating point (FP16) to an 8-bit integer (INT8). Standard models typically use 32-bit floating-point (FP32) precision. However, this higher precision is not always necessary for inference tasks. By converting model weights and activations to lower precision formats like INT8 (8-bit integer), we can achieve faster computations and lower memory usage, effectively reducing the model size by three-fourths (from 32-bit) or half (from 16-bit) with only a slight accuracy reduction, which is often outweighed by the speed gains.

Read more ...

Fine-tuning Llama 3 with Axolotl using ROCm on AMD GPUs

23 September 2024

Large Language Models (LLMs) have revolutionized the field of natural language processing, enabling machines to understand and generate human-like language. However, these models are often trained on vast amounts of general-purpose data, which can make them less effective for specific tasks or domains. Fine-tuning involves training a pre-trained LLM on a specialized dataset to enhance its performance on specific tasks. As Andrej Karpathy analogized, this process is akin to allowing someone to practice a particular skill. Just as a person might need to practice a skill in a specific context to become proficient, an LLM needs to be fine-tuned on a specific dataset to become proficient in a particular task. For instance, an LLM can be fine-tuned for tasks such as financial forecasting, technical support, legal advising, medical diagnosis, or even instruction following. By fine-tuning an LLM, organizations can achieve better results and improve information security by limiting the exposure of sensitive data.

Read more ...

Inferencing and serving with vLLM on AMD GPUs

19 September 2024

09 June 2025

In the rapidly evolving field of artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools for understanding and generating human-like text. However, deploying these models efficiently at scale presents significant challenges. This is where vLLM comes into play. vLLM is an innovative open-source library designed to optimize the serving of LLMs using advanced techniques. Central to vLLM is PagedAttention, a novel algorithm that enhances the efficiency of the model’s attention mechanism by managing it as virtual memory. This approach optimizes GPU memory utilization, facilitating the processing of longer sequences and enabling more efficient handling of large models within existing hardware constraints. Additionally, vLLM incorporates continuous batching to maximize throughput and minimize latency. By leveraging these cutting-edge techniques, vLLM significantly improves the performance and scalability of LLM deployment, allowing organizations to harness the power of state-of-the-art AI models more effectively and economically.

Read more ...

Enhancing vLLM Inference on AMD GPUs

19 September 2024

09 June 2025

In this blog, we’ll demonstrate the latest performance enhancements in vLLM inference on AMD Instinct accelerators using ROCm 6.2. In a nutshell, vLLM optimizes GPU memory utilization, allowing more efficient handling of large language models (LLMs) within existing hardware constraints, maximizing throughput and minimizing latency. We start the blog by briefly explaining how causal language models like Llama 3 and ChatGPT generate text, motivating the need to enhance throughput and reduce latency. If you’re new to vLLM, we also recommend reading our introduction to Inferencing and serving with vLLM on AMD GPUs. ROCm 6.2 introduces support for the following vLLM features which we will use in this blog post.

Read more ...

Getting to Know Your GPU: A Deep Dive into AMD SMI

17 September 2024

For system administrators and power users working with AMD hardware, performance optimization and efficient monitoring of resources is paramount. The AMD System Management Interface command-line tool, amd-smi, addresses these needs.

Read more ...

Introducing the AMD ROCm™ Offline Installer Creator: Simplifying Deployment for AI and HPC

10 September 2024

Document headings start at H2, not H1 [myst.header]

Read more ...

Optimize GPT Training: Enabling Mixed Precision Training in JAX using ROCm on AMD GPUs

06 September 2024

This blog builds on the nanoGPT model we discussed in A Guide to Implementing and Training Generative Pre-trained Transformers (GPT) in JAX on AMD GPUs. Here we will show you how to incorporate mixed precision training to the JAX-implemented nanoGPT model we discussed in our previous blog.

Read more ...

Image Classification with BEiT, MobileNet, and EfficientNet using ROCm on AMD GPUs

03 September 2024

Image classification is a key task in computer vision aiming at “understanding” an entire image. The outcome of an image classifier is a label or a category for the image as a whole, unlike object recognition where the task is to detect and classify multiple objects within an image.

Read more ...

Seismic stencil codes - part 3

29 August 2024

12 Aug, 2024 by

and .

Read more ...

Seismic stencil codes - part 2

29 August 2024

12 Aug, 2024 by

and .

Read more ...

Seismic stencil codes - part 1

29 August 2024

12 Aug, 2024 by

and .

Read more ...

Benchmarking Machine Learning using ROCm and AMD GPUs: Reproducing Our MLPerf Inference Submission

28 August 2024

Measuring the performance of new technologies is as old as human history, and often as intriguing (consider for example that we still compare the performance of new electric vehicle motors using horsepower). In the rapidly advancing field of machine learning (ML) MLPerf was established by MLCommons on May 2nd 2018 and quickly became the golden standard of measuring the accuracy, speed, and efficiency of AI. MLPerf provides benchmarks on training, HPC and Inference performance. Companies across the industry use MLPerf submissions to evaluate the performance of various GPUs and software platforms, and make their technology adoption decisions based on these results.

Read more ...

Performing natural language processing tasks with LLMs on ROCm running on AMD GPUs

21 August 2024

In this blog you will learn how to use ROCm, running on AMD’s Instinct GPUs, for a range of popular and useful natural language processing (NLP) tasks, using different large language models (LLMs). The blog includes a simple to follow hands-on guide that shows you how to implement LLMs for core NLP applications ranging from text generation and sentiment analysis to extractive question answering (QA), and solving a math problem.

Read more ...

Using AMD GPUs for Enhanced Time Series Forecasting with Transformers

19 August 2024

Time series forecasting (TSF) is a key concept in fields such as signal processing, data science, and machine learning (ML). TSF involves predicting future behavior of a system by analyzing its past temporal patterns, using historical data to forecast future data points. Classical approaches to TSF relied on a variety of statistical methods. Recently, machine learning techniques have been increasingly used for TSF, generating discussions within the community about whether these modern approaches outperform the classical statistical ones (see: Are Transformers Effective for Time Series Forecasting? and Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)).

Read more ...

Inferencing with Grok-1 on AMD GPUs

09 August 2024

We demonstrate that the massive Grok-1 model from xAI can run seamlessly on the AMD MI300X GPU accelerator by leveraging the ROCm software platform.

Read more ...

Optimizing RoBERTa: Fine-Tuning with Mixed Precision on AMD

29 July 2024

In this blog we explore how to fine-tune the Robustly Optimized BERT Pretraining Approach (RoBERTa) large language model, with emphasis on PyTorch’s mixed precision capabilities. Specifically, we explore using AMD GPUs for mixed precision fine-tuning to achieve faster model training without any major impacts on accuracy.

Read more ...

Graph analytics on AMD GPUs using Gunrock

29 July 2024

Graphs and graph analytics are related concepts that can help us understand complex data and relationships. In this context, a graph is a mathematical model that represents entities (called nodes or vertices) and their connections (called edges or links). And graph analytics is a form of data analysis that uses graph structures and algorithms to reveal insights from the data.

Read more ...

Using statistical methods to reliably compare algorithm performance in large generative AI models with JAX Profiler on AMD GPUs

22 July 2024

This blog provides a comprehensive guide on measuring and comparing the performance of various algorithms in a JAX-implemented generative AI model. Leveraging the JAX Profiler and statistical analysis, this blog demonstrates how to reliably evaluate key steps and compare algorithm performance on AMD GPUs.

Read more ...

DBRX Instruct on AMD GPUs

11 July 2024

In this blog, we showcase DBRX Instruct, a mixture-of-experts large language model developed by Databricks, on a ROCm-capable system with AMD GPUs.

Read more ...

Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm

11 July 2024

PyTorch 2.0 introduces torch.compile(), a tool to vastly accelerate PyTorch code and models. By converting PyTorch code into highly optimized kernels, torch.compile delivers substantial performance improvements with minimal changes to the existing codebase. This feature allows for precise optimization of individual functions, entire modules, and complex training loops, providing a versatile and powerful tool for enhancing computational efficiency.

Read more ...

Accelerating models on ROCm using PyTorch TunableOp

03 July 2024

In this blog, we will show how to leverage PyTorch TunableOp to accelerate models using ROCm on AMD GPUs. We will discuss the basics of General Matrix Multiplications (GEMMs), show an example of tuning a single GEMM, and finally, demonstrate real-world performance gains on an LLM (gemma) using TunableOp.

Read more ...

A Guide to Implementing and Training Generative Pre-trained Transformers (GPT) in JAX on AMD GPUs

02 July 2024

2 July, 2024 by

.

Read more ...

Mamba on AMD GPUs with ROCm

28 June 2024

28, Jun 2024 by

, , .

Read more ...

Deep Learning Recommendation Models on AMD GPUs

28 June 2024

28, June 2024 by

.

Read more ...

Fine-tuning and Testing Cutting-Edge Speech Models using ROCm on AMD GPUs

27 June 2024

AI Voice agents, or voice bots, are designed to communicate with people using a spoken language. Voice bots are commonly deployed in customer service and personal assistant applications, and have the potential to enter and revolutionize almost every aspect of people’s interaction with technology that can benefit from the use of voice. Automatic Speech Recognition (ASR), the technology that processes human speech into text, is essential for the creation of AI Voice agents. In this blog post we will provide you with a hands-on introduction to the deployment of three machine learning ASR models, using ROCm on AMD GPUs.

Read more ...

TensorFlow Profiler in practice: Optimizing TensorFlow models on AMD GPUs

18 June 2024

TensorFlow Profiler consists of a set of tools designed to measure resource utilization and performance during the execution of TensorFlow models. It offers insights into how a model interacts with hardware resources, including execution time and memory usage. TensorFlow Profiler helps in pinpointing performance bottlenecks, allowing us to fine-tune the execution of models for improved efficiency and faster outcomes which can be crucial in scenarios where near-real-time predictions are required.

Read more ...

Stone Ridge Expands Reservoir Simulation Options with AMD Instinct™ Accelerators

10 June 2024

Stone Ridge Technology (SRT) pioneered the use of GPUs for high performance reservoir simulation (HPC) nearly a decade ago with ECHELON, its flagship software product. ECHELON, the first of its kind, engineered from the outset to harness the full potential of massively parallel GPUs, stands apart in the industry for its power, efficiency, and accuracy. Now, ECHELON has added support for AMDInstinct accelerators into its simulation engine, offering new flexibility and optionality to its clients.

Read more ...

Segment Anything with AMD GPUs

04 June 2024

4 Jun, 2024 by

.

Read more ...

SmoothQuant model inference on AMD Instinct MI300X using Composable Kernel

31 May 2024

The AMD ROCm™ Composable Kernel (CK) library provides a programming model for writing performance-critical kernels for machine learning workloads. It generates a general-purpose kernel during the compilation phase through a C++ template, enabling developers to achieve operation fusions on different data precisions.

Read more ...

Unveiling performance insights with PyTorch Profiler on an AMD GPU

29 May 2024

29 May, 2024 by

.

Read more ...

Panoptic segmentation and instance segmentation with Detectron2 on AMD GPUs

23 May 2024

23, May 2024 by

.

Read more ...

Siemens taps AMD Instinct™ GPUs to expand high-performance hardware options for Simcenter STAR-CCM+

16 May 2024

Siemens recently announced that its Simcenter STAR-CCM+ multi-physics computational fluid dynamics (CFD) software now supports AMD Instinct™ GPUs for GPU-native computation. This move addresses its users’ needs for computational efficiency, reduced simulation costs and energy usage, and greater hardware choice.

Read more ...

AMD Collaboration with the University of Michigan offers High Performance Open-Source Solutions to the Bioinformatics Community

16 May 2024

Long read DNA sequencing technology is revolutionizing genetic diagnostics and precision medicine by helping us discover structural variants and assemble whole genomes. It also helps us study evolutionary relationships. Lower sequencing costs and high-throughput portable long read sequencers are revolutionizing precision medicine today. Long read sequencers from the top manufacturers including Oxford Nanopore (ONT) and PacBio, can produce reads that are much longer than previous generations of sequencers. However, long reads vary in length and are significantly more error prone than short reads. Sequence alignment (on CPUs) is one of the main bottlenecks in long read processing workflows.

Read more ...

Accelerating Large Language Models with Flash Attention on AMD GPUs

15 May 2024

15, May 2024 by

.

Read more ...

Reading AMD GPU ISA

13 May 2024

For an application developer it is often helpful to read the Instruction Set Architecture (ISA) for the GPU architecture that is used to perform its computations. Understanding the instructions of the pertinent code regions of interest can help in debugging and achieving performance optimization of the application.

Read more ...

AMD in Action: Unveiling the Power of Application Tracing and Profiling

07 May 2024

Rocprof is a robust tool designed to analyze and optimize the performance of HIP programs on AMD ROCm platforms, helping developers pinpoint and resolve performance bottlenecks. Rocprof provides a variety of profiling data, including performance counters, hardware traces, and runtime API/activity traces.

Read more ...

Step-by-Step Guide to Use OpenLLM on AMD GPUs

01 May 2024

OpenLLM is an open-source platform designed to facilitate the deployment and utilization of large language models (LLMs), supporting a wide range of models for diverse applications, whether in cloud environments or on-premises. In this tutorial, we will guide you through the process of starting an LLM server using OpenLLM, enabling interaction with the server from your local machine, with special emphasis on leveraging the capabilities of AMD GPUs.

Read more ...

Inferencing with Mixtral 8x22B on AMD GPUs

01 May 2024

1, May 2024 by

.

Read more ...

Training a Neural Collaborative Filtering (NCF) Recommender on an AMD GPU

30 April 2024

30, Apr 2024 by

.

Read more ...

Table Question-Answering with TaPas

26 April 2024

26 Apr, 2024 by

.

Read more ...

Multimodal (Visual and Language) understanding with LLaVA-NeXT

26 April 2024

26, Apr 2024 by

.

Read more ...

Application portability with HIP

26 April 2024

Many scientific applications run on AMD-equipped computing platforms and supercomputers, including Frontier, the first Exascale system in the world. These applications, coming from a myriad of science domains, were ported to run on AMD GPUs using the Heterogeneous-compute Interface for Portability (HIP) abstraction layer. HIP enables these High-Performance Computing (HPC) facilities to transition their CUDA codes to run and take advantage of the latest AMD GPUs. The effort involved in porting these scientific applications varies from a few hours to a few weeks and largely depends on the complexity of the original source code. Figure 1 shows several examples of applications that have been ported and the corresponding porting effort.

Read more ...

Unlocking Vision-Text Dual-Encoding: Multi-GPU Training of a CLIP-Like Model

24 April 2024

24 Apr, 2024 by

.

Read more ...

Transforming Words into Motion: A Guide to Video Generation with AMD GPU

24 April 2024

24 Apr, 2024 by

.

Read more ...

C++17 parallel algorithms and HIPSTDPAR

18 April 2024

The C++17 standard added the concept of parallel algorithms to the pre-existing C++ Standard Library. The parallel version of algorithms like std::transform maintain the same signature as the regular serial version, except for the addition of an extra parameter specifying the execution policy to use. This flexibility allows users that are already using the C++ Standard Library algorithms to take advantage of multi-core architectures by just introducing minimal changes to their code.

Read more ...

Inferencing with AI2’s OLMo model on AMD GPU

17 April 2024

In this blog, we will show you how to generate text using AI2’s OLMo model on AMD GPU.

Read more ...

Text Summarization with FLAN-T5

16 April 2024

In this blog, we showcase the language model FLAN-T5 and how to fine-tune it on a summarization task with HuggingFace in an AMD GPUs + ROCm system.

Read more ...

Speech-to-Text on an AMD GPU with Whisper

16 April 2024

16 Apr, 2024 by

.

Read more ...

PyTorch C++ Extension on AMD GPU

16 April 2024

16, Apr 2024 by

.

Read more ...

Programming AMD GPUs with Julia

16 April 2024

Julia is a high-level, general-purpose dynamic programming language that automatically compiles to efficient native code via LLVM, and supports multiple platforms. With LLVM, comes the support for programming GPUs, including AMD GPUs.

Read more ...

Program Synthesis with CodeGen

16 April 2024

CodeGen is a family of standard transformer-based auto-regressive language models for program synthesis, which as defined by the authors as a method for generating computer programs that solve specified problems, using input-output examples or natural language descriptions.

Read more ...

Interacting with Contrastive Language-Image Pre-Training (CLIP) model on AMD GPU

16 April 2024

16, Apr 2024 by

.

Read more ...

Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs

16 April 2024

16 Apr, 2024 by

.

Read more ...

Affinity part 2 - System topology and controlling affinity

16 April 2024

In Part 1 of the Affinity blog series, we looked at the importance of setting affinity for High Performance Computing (HPC) workloads. In this blog post, our goals are the following:

Read more ...

Affinity part 1 - Affinity, placement, and order

16 April 2024

Modern hardware architectures are increasingly complex with multiple sockets, many cores in each Central Processing Unit (CPU), Graphical Processing Units (GPUs), memory controllers, Network Interface Cards (NICs), etc. Peripherals such as GPUs or memory controllers will often be local to a CPU socket. Such designs present interesting challenges in optimizing memory access times, data transfer times, etc. Depending on how the system is built, hardware components are connected, and the workload being run, it may be advantageous to use the resources of the system in a specific way. In this article, we will discuss the role of affinity, placement, and order in improving performance for High Performance Computing (HPC) workloads. A short case study is also presented to familiarize you with performance considerations on a node in the Frontier supercomputer. In a follow-up article, we also aim to equip you with the tools you need to understand your system’s hardware topology and set up affinity for your application accordingly.

Read more ...

Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama Model on a single AMD GPU

15 April 2024

15, Apr 2024 by

.

Read more ...

Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU

15 April 2024

15, Apr 2024 by

.

Read more ...

Developing Triton Kernels on AMD GPUs

15 April 2024

19 March 2025

OpenAI has developed a powerful GPU focused programming language and compiler called Triton that works seamlessly with AMD GPUs. The goal of Triton is to enable AI engineers and scientists to write high-performant GPU code with minimal expertise. Triton kernels are performant because of their blocked program representation, allowing them to be compiled into highly optimized binary code. Triton also leverages Python for kernel development, making it both familiar and accessible. And the kernels can be easily compiled by simply declaring the triton.jit python decorator before the kernel.

Read more ...

GPU Unleashed: Training Reinforcement Learning Agents with Stable Baselines3 on an AMD GPU in Gymnasium Environment

11 April 2024

11 Apr, 2024 by

.

Read more ...

ResNet for image classification using AMD GPUs

09 April 2024

9 Apr, 2024 by

.

Read more ...

Small language models with Phi-2

08 April 2024

Like many other LLMs, Phi-2 is a transformer-based model with a next-word prediction objective that is trained on billions of tokens. At 2.7 billion parameters, Phi-2 is a relatively small language model, but it achieves outstanding performance on a variety of tasks, including common sense reasoning, language understanding, math, and coding. For reference, GPT 3.5 has 175 billion parameters and the smallest version of LLaMA-2 has 7 billion parameters. According to Microsoft, Phi-2 is capable of matching or outperforming models up to 25 times larger due to more carefully curated training data and model scaling.

Read more ...

Using the ChatGLM-6B bilingual language model with AMD GPUs

04 April 2024

ChatGLM-6B is an open bilingual (Chinese-English) language model with 6.2 billion parameters. It’s optimized for Chinese conversation based on General Language Model (GLM) architecture. GLM is a pretraining framework that seeks to combine the strengths of autoencoder models (like BERT) and autoregressive models (like GPT). The GLM framework randomly blanks out continuous spans of tokens from the input text (autoencoding methodology) and trains the model to sequentially reconstruct the spans (autoregressive pretraining methodology).

Read more ...

Total body segmentation using MONAI Deploy on an AMD GPU

04 April 2024

4, Apr 2024 by

.

Read more ...

Retrieval Augmented Generation (RAG) using LlamaIndex

04 April 2024

4, Apr 2024 by

.

Read more ...

Image classification using Vision Transformer with AMD GPUs

04 April 2024

4 Apr, 2024 by

.

Read more ...

Building semantic search with SentenceTransformers on AMD

04 April 2024

4 Apr, 2024 by

.

Read more ...

Scale AI applications with Ray

01 April 2024

1, Apr 2024 by

Logan Grado, {hoverxref}Eliot Li.

Read more ...

Automatic mixed precision in PyTorch using AMD GPUs

29 March 2024

As models increase in size, the time and memory needed to train them–and consequently, the cost–also increases. Therefore, any measures we take to reduce training time and memory usage can be highly beneficial. This is where Automatic Mixed Precision (AMP) comes in.

Read more ...

Large language model inference optimizations on AMD GPUs

15 March 2024

15, Mar 2024 by

.

Read more ...

Building a decoder transformer model on AMD GPU(s)

12 March 2024

12, Mar 2024 by

.

Read more ...

Question-answering Chatbot with LangChain on an AMD GPU

11 March 2024

11, Mar 2024 by

.

Read more ...

Music Generation With MusicGen on an AMD GPU

08 March 2024

MusicGen is an autoregressive, transformer-based model that predicts the next segment of a piece of music based on previous segments. This is a similar approach to language models predicting the next token.

Read more ...

Efficient image generation with Stable Diffusion models and ONNX Runtime using AMD GPUs

23 February 2024

23 Feb, 2024 by

.

Read more ...

Simplifying deep learning: A guide to PyTorch Lightning

08 February 2024

8, Feb 2024 by

.

Read more ...

Two-dimensional images to three-dimensional scene mapping using NeRF on an AMD GPU

07 February 2024

7, Feb 2024 by

.

Read more ...

Using LoRA for efficient fine-tuning: Fundamental principles

05 February 2024

5, Feb 2024 by

.

Read more ...

Fine-tune Llama model with LoRA: Customizing a large language model for question-answering

01 February 2024

1, Feb 2024 by

.

Read more ...

Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering

01 February 2024

1, Feb 2024 by

.

Read more ...

Pre-training BERT using Hugging Face & TensorFlow on an AMD GPU

29 January 2024

29, Jan 2024 by

.

Read more ...

Pre-training BERT using Hugging Face & PyTorch on an AMD GPU

26 January 2024

26, Jan 2024 by

.

Read more ...

Accelerating XGBoost with Dask using multiple AMD GPUs

26 January 2024

26 Jan, 2024 by

.

Read more ...

LLM distributed supervised fine-tuning with JAX

25 January 2024

25 Jan, 2024 by

.

Read more ...

Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs

24 January 2024

24 Jan, 2024 by

.

Read more ...

Efficient image generation with Stable Diffusion models and AITemplate using AMD GPUs

24 January 2024

24 Jan, 2024 by

.

Read more ...

Efficient deployment of large language models with Text Generation Inference on AMD GPUs

24 January 2024

24 Jan, 2024 by

.

Read more ...

Sparse matrix vector multiplication - part 1

03 November 2023

3 Nov, 2023 by

.

Read more ...

Jacobi Solver with HIP and OpenMP offloading

15 September 2023

15 Sept, 2023 by

, , .

Read more ...

Creating a PyTorch/TensorFlow code environment on AMD GPUs

11 September 2023

Goal: The machine learning ecosystem is quickly exploding and we aim to make porting to AMD GPUs simple with this series of machine learning blogposts.

Read more ...

Finite difference method - Laplacian part 4

18 July 2023

18 Jul, 2023 by

, , .

Read more ...

GPU-aware MPI with ROCm

08 June 2023

MPI is the de facto standard for inter-process communication in High-Performance Computing. MPI processes compute on their local data while extensively communicating with each other. This enables MPI programs to be executed on systems with a distributed memory space e.g. clusters. There are different types of communications supported in MPI including point-to-point and collective communications. Point-to-point communication is the basic communication mechanism in which both the sending process and the receiving process take part in the communication. The sender has a buffer that holds the message and an envelope containing information that will be used by the receiver side (e.g., message tag, the sender rank number, etc.). The receiver uses the information in the envelope to select the specified message and stores it in its receiver buffer. In collective communication, messages can be exchanged among a group of processes rather than just two of them. Collective communication provides opportunities for processes to perform one-to-many and many-to-many communications in a convenient, portable and optimized way. Some examples of collective communications include broadcast, allgather, alltoall, and allreduce.

Read more ...

Register pressure in AMD CDNA™2 GPUs

17 May 2023

Register pressure in GPU kernels has a tremendous impact on the overall performance of your HPC application. Understanding and controlling register usage allows developers to carefully design codes capable of maximizing hardware resources. The following blog post is focused on a practical demo showing how to apply the recommendations explained in this OLCF training talk presented on August 23rd 2022. Here is the training archive where you can also find the slides. We focus solely on the AMD CDNA™2 architecture (MI200 series GPUs) using ROCm 5.4.

Read more ...

Finite difference method - Laplacian part 3

11 May 2023

11 May, 2023 by

, , , , .

Read more ...

Introduction to profiling tools for AMD hardware

12 April 2023

Getting a code to be functionally correct is not always enough. In many industries, it is also required that applications and their complex software stack run as efficiently as possible to meet operational demands. This is particularly challenging as hardware continues to evolve over time, and as a result codes may require further tuning. In practice, many application developers construct benchmarks, which are carefully designed to measure the performance, such as execution time, of a particular code within an operational-like setting. In other words: a good benchmark should be representative of the real work that needs to be done. These benchmarks are useful in that they provide insight into the characteristics of the application, and enables one to discover potential bottlenecks that could result in performance degradation during operational settings.

Read more ...

AMD Instinct™ MI200 GPU memory space overview

09 March 2023

The HIP API supports a wide variety of allocation methods for host and device memory on accelerated systems. In this post, we will:

Read more ...

AMD ROCm™ installation

26 January 2023

AMD ROCm™ is the first open-source software development platform for HPC/Hyperscale-class GPU computing. AMD ROCm™ brings the UNIX philosophy of choice, minimalism and modular software development to GPU computing. Please see the AMD Open Software Platform for GPU Compute and ROCm Informational Portal pages for more information.

Read more ...

Finite difference method - Laplacian part 2

04 January 2023

4 Jan, 2023 by

, , , , .

Read more ...

Finite difference method - Laplacian part 1

14 November 2022

14 Nov, 2022 by

, , , , .

Read more ...

AMD matrix cores

14 November 2022

Matrix multiplication is a fundamental aspect of linear algebra and it is an ubiquitous computation within High Performance Computing (HPC) Applications. Since the introduction of AMD’s CDNA Architecture, Generalized Matrix Multiplication (GEMM) computations are now hardware-accelerated through Matrix Core Processing Units. Matrix Core accelerated GEMM kernels lie at the heart of BLAS libraries like rocBLAS but they can also be programmed directly by developers. Applications that are throughput bound by GEMM computation can achieve additional speedups by utilizing Matrix Cores.

Read more ...