Posted in 2025

Optimized ROCm Docker for Distributed AI Training

This blog will introduce you to the updated AMD Docker image, specifically built and optimized for distributed training. As you will see, the optimized AMD ROCm Docker image makes training large AI models faster and more efficient. It includes updates such as better fine-tuning tools, improved performance for multi-GPU setups, and support for FP8 precision, which helps speed up training while using less memory, and can provide you with an overall smoother and more efficient training experience on popular models such as Flux and Llama 3.1 running on AMD GPUs.

Read more ...


AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 3

Welcome back to the final part of our series! So far, we’ve successfully setup up a Kubernetes cluster and installed the AMD GPU Operator to seamlessly integrate AMD hardware with Kubernetes in Part 1. We’ve deployed vLLM on AMD Instinct MI300X GPUs, exposed it using MetalLB, and scaled it efficiently in Part 2.

Read more ...


AMD Advances Enterprise AI Through OPEA Integration

AMD is excited to support Open Platform for Enterprise AI (OPEA) to simplify and accelerate enterprise AI adoption. With the enablement of OPEA GenAI framework on AMD ROCm™ software stack, businesses and developers can now create scalable, efficient GenAI applications on AMD data center GPUs. Enterprises today face significant challenges when deploying AI at scale, including the complexity of integrating GenAI models, managing GPU resources, ensuring security, and maintaining workflow flexibility. AMD and OPEA aim to address these challenges and streamline AI adoption. This blog will explore the significance of this collaboration, AMD’s contribution to the OPEA project, and demonstrate how to deploy a code translation OPEA GenAI use case on the AMD Instinct™ MI300X GPU.

Read more ...


Instella-VL-1B: First AMD Vision Language Model

As part of AMD’s newly released Instella family we are thrilled to introduce Instella-VL-1B, the first AMD vision language model for image understanding trained on AMD Instinct™ MI300X GPUs. Our journey with Instella-VL builds upon our previous 1-billion-parameter language models, AMD OLMo SFT. We further extend the language model’s visual understanding abilities by connecting it with a vision encoder (which is initialized from CLIP ViT-L/14-336). During training, we jointly finetune vision encoder and language model with vision-language data in three stages: Alignment, Pretraining and Supervised-Finetuning (SFT).

Read more ...


Introducing Instella: New State-of-the-art Fully Open 3B Language Models

AMD is excited to announce Instella, a family of fully open state-of-the-art 3-billion-parameter language models (LMs) trained from scratch on AMD Instinct™ MI300X GPUs. Instella models outperform existing fully open models of similar sizes and achieve competitive performance compared to state-of-the-art open-weight models such as Llama-3.2-3B, Gemma-2-2B, and Qwen-2.5-3B, including their instruction-tuned counterparts.

Read more ...


Understanding RCCL Bandwidth and xGMI Performance on AMD Instinct™ MI300X

Efficient inter-GPU communication is the backbone of high-performance AI and HPC workloads, where technologies like RCCL and xGMI play pivotal roles. However, some limitations in achieving theoretical peak bandwidth have raised questions about performance bottlenecks. In this blog we explain the limitations to achieve the theoretical maximum bandwidth in multi-GPU clusters, and teach you how to perform a set of diagnostics and performance-tuning strategies that will help you optimize RCCL and xGMI bandwidth on AMD MI300X systems. We will first introduce you to xGMI and its performance constraints, to RCCL and its bandwidth limitations, and then cover several practical benchmarks and best practices for maximizing RCCL efficiency.

Read more ...


Measuring Max-Achievable FLOPs – Part 2

In our previous blog post, we explored the conceptual differences between Peak FLOPs and Max-Achievable FLOPs (MAF), explaining why the gap between these metrics has widened with modern ML-optimized hardware. This second installment provides a detailed methodology for measuring MAF on AMD GPUs, including the specific environmental conditions, matrix size optimization techniques, and tools required for accurate measurement. We present the actual MAF results for AMD Instinct MI300X and MI325X GPUs across different precision formats (FP16, BF16, and FP8) along with their corresponding median frequencies. We also explain how software efficiency and frequency management impact MAF, and demonstrate why boost clock capabilities remain important for latency-sensitive workloads such as LLM inference with small batch sizes.

Read more ...


Deploying Serverless AI Inference on AMD GPU Clusters

Deploying Large Language Models (LLMs) in enterprise environments presents a multitude of challenges that organizations must navigate to harness their full potential. As enterprises expand their AI and HPC workloads, scaling the underlying compute and GPU infrastructure presents numerous challenges, including deployment complexities, resource optimization, and effective management of the compute resource fleet. In this blog, we will walk you through how to spin-up production-grade Serverless AI inference service on Kubernetes clusters by leveraging open source Knative/KServe technologies.

Read more ...


Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU

In this blog, we explore how DeepSeek-R1 achieves competitive performance on AMD Instinct™ MI300X GPUs, along with performance comparisons to H200 and a short demo application showcasing real-world usage. By leveraging MI300X, users can deploy DeepSeek-R1 and V3 models on a single node with impressive efficiency. In just two weeks, optimizations using SGLang have unlocked up to a 4X boost in inference speed, ensuring efficient scaling, lower latency, and optimized throughput. The MI300X’s high-bandwidth memory (HBM) and compute power enable execution of complex AI workloads, handling longer sequences and demanding reasoning tasks. With AMD and the SGLang community driving ongoing optimizations—including fused MoE kernels, MLA kernel fusion, and speculative decoding—MI300X is set to deliver an even more powerful AI inference experience.

Read more ...


How to Build a vLLM Container for Inference and Benchmarking

Welcome back! If you’ve been following along with this series, you’ve already learned about the basics of ROCm containers. Today, we’ll build on that foundation by creating a container for large language model inference with vLLM.

Read more ...


Fine-tuning Phi-3.5-mini LLM at scale: Harnessing Accelerate and Slurm for multinode training

In this blog you will learn the process of fine-tuning the Phi-3.5-mini-instruct Large Language Model (LLM) from Microsoft, using PyTorch in a multinode environment. The setup leverages the Hugging Face Accelerate library to handle the complexities of multi-GPU and multinode synchronization. Slurm is used to schedule and coordinate the job as a workload manager for high-performance computing environments. A custom Slurm Bash script launches the Docker containers on each node, ensuring the training environment is consistent across all machines. Inside the containers, PyTorch and the Accelerate library split the training data, synchronize the model updates, and optimize performance across the multinode setup. This approach lets you efficiently fine-tune large-scale models and reduce training time while maximizing hardware utilization across the entire cluster.

Read more ...


Understanding Peak, Max-Achievable & Delivered FLOPs, Part 1

The purpose of this blog post is to provide information on the differences between Peak FLOPs and Max-achievable FLOPs. After reading, users will know how AMD measures maximum delivered performance, and how AMD recommends measured device performance is used.

Read more ...


AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 2

Welcome to Part 2 of our series on utilizing Kubernetes with the AMD Instinct platform! If you’re just joining us, we recommend checking out Part 1 where we covered setting up your Kubernetes cluster and enabling AMD GPU support.

Read more ...


Navigating vLLM Inference with ROCm and Kubernetes

Kubernetes (often abbreviated as K8s) is an open-source platform designed for automating the deployment, scaling, and management of containerized applications. Developed by Google and now maintained by the Cloud Native Computing Foundation, Kubernetes enables developers to build, run, and manage applications across any infrastructure.

Read more ...


PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm

PyTorch Fully Sharded Data Parallel (FSDP) is a data parallelism technique that enables the training of large-scale models in a memory-efficient manner. FSDP achieves this memory efficiency by sharding model parameters, optimizer states, and/or gradients across GPUs, reducing the memory footprint required by each GPU. This enables the training of large-scale models with lower total GPU memory than DDP (Distributed Data Parallel), in which the model weights and optimizer states are replicated across all processes. To learn more about DDP, refer to Distributed Data Parallel (DDP) training on AMD GPU with ROCm.

Read more ...


MI300A - Exploring the APU advantage

This blog post will introduce you to the advantages of AMD Instinct™ MI300A accelerated processing unit (APU), discussing the hardware architecture and how to leverage its GPU programming capabilities.

Read more ...


Deep dive into the MI300 compute and memory partition modes

This blog introduces the inner compute and memory architecture of the AMD Instinct™ MI300, showing you how to use the MI300 GPU’s different partition modes to supercharge performance critical applications. In this blog, you will first get a brief introduction to the MI300 architecture, explaining how the MI300 compute and memory partitions can be used to your advantage. You will then learn in detail the compute partitioning modes and the memory partitioning modes, Further, two case studies demonstrate and benchmark the performance of the different modes. For convenience this blog uses the MI300X as a case-in-point example.

Read more ...


AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 1

As organizations scale their AI inference workloads, they face the challenge of efficiently deploying and managing large language models across GPU infrastructure. This three-part blog series provides a production-ready foundation for orchestrating AI inference workloads on the AMD Instinct platform with Kubernetes.

Read more ...


GEMM Kernel Optimization For AMD GPUs

Matrix multiplication underlies critical computational pathways in AI, with General Matrix Multiplication (GEMM) operations serving as performance-critical kernels in neural network architectures. From fully connected layers to convolutions and transformer attention mechanisms, GEMMs consume substantial computational and memory resources in large language models (LLMs). This blog explores GEMM optimization techniques for AMD GPUs, demonstrating methodologies to significantly enhance computational efficiency and performance scaling.

Read more ...


Enhancing AI Training with AMD ROCm Software

ROCm™ has emerged as a premier open software stack designed to address the evolving needs of AI and machine learning workloads. Built for inference and training, ROCm delivers leadership performance, empowering developers and organizations to optimize their workloads for efficiency, scalability, and cost-effectiveness.

Read more ...


Best practices for competitive inference optimization on AMD Instinct™ MI300X GPUs

Optimizing LLM performance on GPUs is challenging due to diverse model needs, memory constraints, and balancing latency and throughput. This document examines how hardware utilization, memory and communication bandwidth and scaling, contribute to inference performance, detailing optimal configurations for AMD Instinct™ MI300X GPUs.

Read more ...


Announcing the AMD GPU Operator and Metrics Exporter

As AI workloads continue to grow in complexity and scale, we’ve consistently heard one thing from our customers: “Managing GPU infrastructure shouldn’t be the hard part”. For many, this is where Kubernetes comes into play. Kubernetes allows customers to easily manage and deploy their AI workloads at scale by providing a robust platform for automating deployment, scaling, and operations of application containers across clusters of hosts. It ensures that your applications run consistently and reliably, regardless of the underlying infrastructure. A pod is the smallest and simplest Kubernetes object. It represents a single instance of a running process in your cluster and can contain one or more containers. Pods are used to host your application workloads and are managed by Kubernetes to ensure they run as expected. Having pods be able to leverage GPUs on your cluster, however, is not something that is trivial.

Read more ...


Distributed fine-tuning of MPT-30B using Composer on AMD GPUs

Composer, developed by MosaicML, is an open-source deep learning training library built on top of PyTorch, designed to simplify and optimize distributed training workflows. It supports scalable training on multiple nodes and efficiently handles datasets of various sizes. Composer integrates advanced techniques such as PyTorch Fully Sharded Data Parallelism (FSDP), elastic sharded checkpointing, training callbacks, and speed-up algorithms to enhance training performance and flexibility. It closely resembles PyTorch’s torchrun and has demonstrated exceptional efficiency when scaling to hundreds of GPUs.

Read more ...


Vision Mamba on AMD GPU with ROCm

State Space Models (SSMs), such as Mamba, have emerged as a potential alternative to Transformer models. Vision backbones using only SSMs have yielded promising results. For more information about SSMs and Mamba’s performance on AMD hardware, see Mamba on AMD GPUs with ROCm. This blog explores Vision Mamba (Vim), an innovative and efficient backbone for vision tasks and evaluate its performance on AMD GPUs with ROCm. We’ll start with a brief introduction to Vision Mamba, followed by a step-by-step guide on training and running inference with Vision Mamba on AMD GPUs using ROCm.

Read more ...


Getting started with AMD ROCm containers: from base images to custom solutions

Having worked in technology for over two decades, I’ve witnessed firsthand how containerization has transformed the way we develop and deploy applications. Containers package applications with their dependencies into standardized units, making software portable and consistent across different environments. When we combine this containerization power with AMD Instinct™ Accelerators, we get a powerful solution for quickly deploying AI and machine learning workloads. In this blog, the first in a series exploring ROCm containerization, I want to share my insights about AMD ROCm™ containers and show you how to build and customize your own GPU-accelerated workloads. You’ll learn how to select appropriate base images, modify containers for your specific needs, and implement best practices for GPU-enabled containerization - all with hands-on examples you can try yourself.

Read more ...


Boosting Computational Fluid Dynamics Performance with AMD Instinct™ MI300X

This blog will guide you, step-by-step, through the process of installing and running benchmarks with Ansys Fluent and AMD MI300X. We start with an overview of the Ansys Fluent CFD application and then show you how to set up an AMD MI300X system to run benchmarks. The blog benchmarks results demonstrate the dramatic impact the MI300X has on speeding up simulations, improving design efficiency, and reducing costs in the automotive, aerospace, and environmental engineering industries.

Read more ...


Triton Inference Server with vLLM on AMD GPUs

Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained AI models from various machine learning and deep learning frameworks including Tensorflow, PyTorch, and vLLM, making it adaptable for diverse AI workloads. It is designed to work across multiple environments, including cloud, data centers and edge devices.

Read more ...