Posts tagged Optimization

Performance Profiling on AMD GPUs - Part 3: Advanced Usage

23 October 2025

Error parsing meta tag attribute “keywords”: No content.

Read more ...

Medical Imaging on MI300X: Optimized SwinUNETR for Tumor Detection

07 October 2025

This blog is part of a series of walkthroughs of Life Science AI models, stemming from this article.

Read more ...

Optimizing Drug Discovery Tools on AMD MI300s Part 2: 3D Molecular Generation with SemlaFlow

03 October 2025

This blog is part of a series of walkthroughs of Life Science AI models, stemming from this article, which was a collaborative effort between AstraZeneca and AMD. The series delves into what was required in order to run drug discovery related AI workloads on AMD MI300X. The first post in this series, available here, focuses on REINVENT4, a molecular design tool used to generate and optimize candidate molecules. This blog, in particular, looks at SemlaFlow, an efficient 3D molecular generation model with latent attention and equivariant flow matching.

Read more ...

Enabling FlashInfer on ROCm for Accelerated LLM Serving

01 October 2025

FlashInfer is an innovative framework designed to accelerate inference of large language models (LLMs). Given the explosive growth and adoption of models like DeepSeek R1, Llama 3, and Qwen 3, efficient inference is critical to meet the demands of real-world deployment. However, challenges such as GPU memory bottlenecks, throughput limitations, and latency remain significant hurdles for deploying these models at scale.

Read more ...

Matrix Core Programming on AMD CDNA™3 and CDNA™4 architecture

30 September 2025

In this blog post, we walk through how to use Matrix Cores in HIP kernels, with a focus on low-precision data types such as FP16, FP8, and FP4, as well as the new family of Matrix Core instructions with exponent block scaling introduced in the AMD CDNA™4 architecture. Through code examples and illustrations, we provide the necessary knowledge to start programming Matrix Cores, covering modern low-precision floating-point types, the Matrix Core compiler intrinsics, and the data layouts required by the Matrix Core instructions.

Read more ...

Optimizing Drug Discovery Tools on AMD MI300X Part 1: Molecular Design with REINVENT

19 September 2025

This blog is part of a series of walkthroughs of Life Science AI models, stemming from this article which was a collaborative effort between AstraZeneca and AMD. The series delves into what was required in order to run drug discovery related AI workloads on AMD MI300X. This blog, in particular, looks at REINVENT4, a powerful molecular design tool that leverages advanced algorithms for de novo design, scaffold hopping, R-group replacement, linker design, and molecule optimization.

Read more ...

ROCm 7.0: An AI-Ready Powerhouse for Performance, Efficiency, and Productivity

16 September 2025

Artificial intelligence now defines the performance envelope for modern computation. In this blog, we introduce the AI-centric ROCm 7.0 designed to help our community directly benefit from this dramatic paradigm shift. ROCm 7.0 delivers a platform purpose-built for the era of generative AI, large-scale inference and training, and accelerated discovery, helping you boost the performance, efficiency, and scalability of your workloads.

Read more ...

Technical Dive into AMD’s MLPerf Inference v5.1 Submission

09 September 2025

In the rapidly evolving landscape of artificial intelligence, the demand for reliable and efficient model inference has never been greater. With advancements in large language models (LLMs) and a growing reliance on real-time applications, benchmarks are critical in evaluating how well AI systems perform under varying conditions. Enter MLPerf Inference: Datacenter v5.1 — a significant update to the well-respected benchmarking suite that assesses inference performance across a wide array of models and use cases, catering especially to data centers.

Read more ...

Slim Down Your Llama: Pruning & Fine-Tuning for Maximum Performance

09 September 2025

In this blog, we demonstrate how quantization, intelligent depth pruning and supervised fine-tuning can dramatically improve the inference performance of Meta’s Llama 3.1 405B model on AMD Instinct MI355X GPUs. By applying quantization and reducing the number of layers from the original 126, we are able to decrease memory requirements and boost token throughput. Additionally, with carefully applied fine-tuning, we maintain high inference accuracy for both RougeL and Exact Match metrics on MLPerf workloads. To see how these optimizations fit into AMD’s broader MLPerf Inference v5.1 efforts, read Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.1 Submission. For a detailed technical breakdown into other optimizations, check out our Technical Dive into AMD’s MLPerf Inference v5.1 Submission.

Read more ...

Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.1 Submission

09 September 2025

MLPerf Inference v5.1 marks AMD’s third round of submissions and the most ambitious yet. This round features submissions on AMD Instinct MI325X and MI355X systems, including multi-node inference and models in MXFP4 datatype. Building upon the success in MLPerf Inference v5.0, AMD has submitted improved results for Llama 2 70B and SDXL on the MI325X platform in this round using new optimization techniques. For a deeper look at these optimizations, see our Technical Dive into AMD’s MLPerf Inference v5.1 Submission. Additionally, explore how we optimized Llama 3.1 405B through pruning and fine-tuning in Slim Down Your Llama: Pruning & Fine-Tuning for Maximum Performance. In addition, AMD has made submissions for the following workloads:

Read more ...

Unleashing AMD Instinct™ MI300X GPUs for LLM Serving: Disaggregating Prefill & Decode with SGLang

28 August 2025

LLM inference pipelines are hitting a scalability wall as prefill and decode phases compete for the same compute, causing latency spikes and underutilized resources. DistServe tackles this by disaggregating prefill and decode computation across separate GPUs—eliminating interference, decoupling resource planning, and unlocking new levels of optimization for both time-to-first-token (TTFT) and time-per-output-token (TPOT).

Read more ...

Performance Profiling on AMD GPUs – Part 2: Basic Usage

13 August 2025

Error parsing meta tag attribute “keywords”: No content.

Read more ...

Avoiding LDS Bank Conflicts on AMD GPUs Using CK-Tile Framework

25 July 2025

LDS bank conflict is a common performance bottleneck in GPU kernel development. Composable Kernel (CK-Tile), a kernel development framework for AMD GPUs, provides a framework-level solution for LDS bank conflicts. Composable Kernel for ROCm is used to build portable high-performance kernels for accelerating computing, e.g. HPC, DL and LLMs for training and inference workloads. In this blog, we show you how to analyze, detect, and eliminate LDS bank conflicts using CK-Tile, AMD’s composable GPU kernel framework. A GEMM kernel serves as a classic example for analyzing how threads interact with LDS during both reads and writes. Starting with a naïve memory layout, we evaluate bank conflict behavior, explore mitigation techniques such as padding, and ultimately demonstrate how an XOR-based swizzle transformation achieves a bank conflict-free design.

Read more ...

Unlocking GPU-Accelerated Containers with the AMD Container Toolkit

03 July 2025

In the rapidly evolving fields of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML), containerization has become a cornerstone of modern application deployment. Containers provide a lightweight, portable, and scalable way to package applications and their dependencies. The integration of GPUs into these environments has become imperative. However, leveraging GPU acceleration within containers has historically been a complex and error-prone process, particularly when ensuring seamless access to GPU hardware resources.

Read more ...

Performance Profiling on AMD GPUs – Part 1: Foundations

26 June 2025

Error parsing meta tag attribute “keywords”: No content.

Read more ...

The ROCm Revisited Series

06 June 2025

The ROCm Revisited series aims to revisit key concepts of the AMD ROCm software platform, tools, and optimizations, tailored for beginner and intermediate developers. This series shares our journey through the evolution of ROCm, highlighting the milestones, innovative technologies, and challenges we’ve overcome to establish leadership in the supercomputing space. Each post explores different aspects of ROCm’s development, focusing on how it has transformed industries, particularly in AI, machine learning, and high-performance computing (HPC). Through these blog posts, we’ll also discuss our commitment to open-source development and the future potential of distributed and energy-efficient computing. Below are the three blogs included in the series:

Read more ...

Reproduce AMD’s MLPerf Training v5.0 Submission Result with Instinct™ GPUs

04 June 2025

In recent years, large language models (LLMs) have transformed the landscape of natural language processing, enabling breakthroughs in tasks ranging from code generation to answering complex questions. Among these, the Llama 2 model family developed by Meta has emerged as a powerful and versatile set of open weight transformer-based models, known for their competitive performance across diverse NLP benchmarks. With model sizes ranging from 7 billion to 70 billion parameters, Llama 2 has quickly become a popular choice for both research and industry after its release in 2023, striking a balance between scalability and efficiency.

Read more ...

AMD’s MLPerf Training Debut: Optimizing LLM Fine-Tuning with Instinct™ GPUs

04 June 2025

MLPerf Training is one of the most influential benchmarks in the AI community, playing a critical role in measuring and advancing the performance of machine learning training across diverse hardware and software platforms. Established to provide a fair, standardized way to evaluate training speed and efficiency on real-world workloads, MLPerf Training has become the chosen standard for researchers, engineers, and organizations striving to test the boundaries of AI capability. By fostering transparency and innovation, it focuses on progression in both academic research and industry applications, helping the community identify the most effective technologies to power the next generation of intelligent systems.

Read more ...

Accelerate DeepSeek-R1 Inference: Integrate AITER into SGLang

16 May 2025

To achieve optimized LLM performance on GPUs, high-performance AI operators/kernels are very critical. AMD recently announced AITER, a centralized repository designed to accelerate AI workloads by providing a unified collection of high-performance AI operators. It serves as a comprehensive hub for customer-level operator requests, supporting diverse needs across private, public, or custom frameworks. With both C++ and Python APIs, AITER enables developers to focus on operator development while offering flexible backend kernel implementations using Triton, CK, or assembly. AITER supports inference, training kernels, GEMM, and communication kernels, allowing flexibility across different kernel-framework pairings and architectural limitations. In this blog we will provide a comprehensive, step-by-step hands-on guide on integrating AITER operators into SGLang for DeepSeek-R1. SGLang is a fast serving framework for large language and vision language models. For DeepSeek-R1, SGLang incorporates MLA (Multi-Head Latent Attention) optimizations and supports FP8 precision (specifically W8A8 format). These enhancements enable the identification of target modules that can be replaced with AITER-optimized solutions, improving overall efficiency and performance. AITER integration delivers significant performance improvements across the entire inference pipeline while maintaining full functional equivalence with the original architecture.

Read more ...

Optimizing DeepseekV3 Inference on SGLang Using ROCm Profiling Tools

01 May 2025

As LLMs are growing in size and complexity, ensuring proper utilization of compute resources becomes of prime importance. Performance profiling and kernel-level analysis are essential techniques for diagnosing runtime bottlenecks, such as GPU time, memory-bound operations, and inefficient device-host memory transfers etc. By using profiling tools like RocmProfileData (RPD) and TorchProfiler (PyTorch Profiler) developers have access to granular level insight into kernel execution timelines, data movement patterns, and computational hotspots. In this blog, we delve into how profiling and kernel diagnostics can expose inefficiencies in components like attention mechanisms and Mixture-of-Experts (MoE) layers — and guide targeted optimizations at the kernel level.

Read more ...

Shrink LLMs, Boost Inference: INT4 Quantization on AMD GPUs with GPTQModel

09 April 2025

GPTQ (Generalized Post Training Quantization) is a technique for compressing Large Language Models (LLMs) after they have been fully trained by reducing their numerical precision. The objective of compressing the model is to reduce its memory footprint and computational requirements, making it easier to deploy it on hardware with limited resources.

Read more ...

Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.0 Submission

02 April 2025

Building upon the success of our MLPerf Inference v4.1 submission, AMD has submitted results for two popular models – Llama 2 70B and Stable Diffusion XL (SDXL) – in the MLPerf Inference v5.0 round. This blog post provides a comprehensive, step-by-step guide on reproducing the results of AMD’s MLPerf submission using ROCm and the AMD Instinct™ MI325X GPUs. Please follow along to independently verify these results and gain hands-on experience with the benchmarking process. If you are interested in learning more about the advanced optimization strategies behind our Llama 2 70B and SDXL inference, from quantization and General Matrix Multiplication (GEMM) tuning to cutting-edge vLLM scheduling and platform enhancements, check out our blog on MLPerf Inference v5.0 optimization strategies.

Read more ...

AMD Instinct™ MI325X GPUs Produce Strong Performance in MLPerf Inference v5.0

02 April 2025

AI transformation and its ever-increasing demands of GenAI, LLMs, reasoning models and new advances in inference and training emphasize the need for innovative GPU architectures and products designed and delivered at an accelerated pace. Understanding the performance of AI models on these GPUs is critical for continuous advances in AI deployments and adoption. However, benchmarking AI models is challenging due to their inherent complexity and variety of possible deployments and tasks. Approaching this problem from a cross-industry perspective is preferable to have a benchmark that is comparable across different platforms and vendors. MLPerf is such a benchmark created by a cross-industry MLCommons consortium of which AMD is a founding member.

Read more ...

Optimized ROCm Docker for Distributed AI Training

13 March 2025

This blog will introduce you to the updated AMD Docker image, specifically built and optimized for distributed training. As you will see, the optimized AMD ROCm Docker image makes training large AI models faster and more efficient. It includes updates such as better fine-tuning tools, improved performance for multi-GPU setups, and support for FP8 precision, which helps speed up training while using less memory, and can provide you with an overall smoother and more efficient training experience on popular models such as Flux and Llama 3.1 running on AMD GPUs.

Read more ...

Understanding RCCL Bandwidth and xGMI Performance on AMD Instinct™ MI300X

02 March 2025

Efficient inter-GPU communication is the backbone of high-performance AI and HPC workloads, where technologies like RCCL and xGMI play pivotal roles. However, some limitations in achieving theoretical peak bandwidth have raised questions about performance bottlenecks. In this blog we explain the limitations to achieve the theoretical maximum bandwidth in multi-GPU clusters, and teach you how to perform a set of diagnostics and performance-tuning strategies that will help you optimize RCCL and xGMI bandwidth on AMD MI300X systems. We will first introduce you to xGMI and its performance constraints, to RCCL and its bandwidth limitations, and then cover several practical benchmarks and best practices for maximizing RCCL efficiency.

Read more ...

Measuring Max-Achievable FLOPs – Part 2

28 February 2025

In our previous blog post, we explored the conceptual differences between Peak FLOPs and Max-Achievable FLOPs (MAF), explaining why the gap between these metrics has widened with modern ML-optimized hardware. This second installment provides a detailed methodology for measuring MAF on AMD GPUs, including the specific environmental conditions, matrix size optimization techniques, and tools required for accurate measurement. We present the actual MAF results for AMD Instinct MI300X and MI325X GPUs across different precision formats (FP16, BF16, and FP8) along with their corresponding median frequencies. We also explain how software efficiency and frequency management impact MAF, and demonstrate why boost clock capabilities remain important for latency-sensitive workloads such as LLM inference with small batch sizes.

Read more ...

Leaner LLM Inference with INT8 Quantization on AMD GPUs using PyTorch

03 October 2024

With the scale of large language models (LLMs) reaching hundred of billions of parameters, the ways we represent data within these enormous models dramatically impacts the resources required to train them (e.g. the number of GPUs needed for inference). In our previous blogs (JAX mixed precision training; PyTorch AMP), we already demonstrated how mixed precision training can accelerate LLMs training process. In this blog post we will push things further and show you how quantization into an even lower precision data formats can speed up inference, saving time and memory, without sacrificing the overall performance of the model. Quantization is a technique where the precision of a model’s parameters is reduced from a 32-bit floating point (FP32) or a 16-bit floating point (FP16) to an 8-bit integer (INT8). Standard models typically use 32-bit floating-point (FP32) precision. However, this higher precision is not always necessary for inference tasks. By converting model weights and activations to lower precision formats like INT8 (8-bit integer), we can achieve faster computations and lower memory usage, effectively reducing the model size by three-fourths (from 32-bit) or half (from 16-bit) with only a slight accuracy reduction, which is often outweighed by the speed gains.

Read more ...

Getting to Know Your GPU: A Deep Dive into AMD SMI

17 September 2024

For system administrators and power users working with AMD hardware, performance optimization and efficient monitoring of resources is paramount. The AMD System Management Interface command-line tool, amd-smi, addresses these needs.

Read more ...

Seismic stencil codes - part 3

29 August 2024

12 Aug, 2024 by

and .

Read more ...

Seismic stencil codes - part 2

29 August 2024

12 Aug, 2024 by

and .

Read more ...

Optimizing RoBERTa: Fine-Tuning with Mixed Precision on AMD

29 July 2024

In this blog we explore how to fine-tune the Robustly Optimized BERT Pretraining Approach (RoBERTa) large language model, with emphasis on PyTorch’s mixed precision capabilities. Specifically, we explore using AMD GPUs for mixed precision fine-tuning to achieve faster model training without any major impacts on accuracy.

Read more ...

Using statistical methods to reliably compare algorithm performance in large generative AI models with JAX Profiler on AMD GPUs

22 July 2024

This blog provides a comprehensive guide on measuring and comparing the performance of various algorithms in a JAX-implemented generative AI model. Leveraging the JAX Profiler and statistical analysis, this blog demonstrates how to reliably evaluate key steps and compare algorithm performance on AMD GPUs.

Read more ...

Accelerating models on ROCm using PyTorch TunableOp

03 July 2024

In this blog, we will show how to leverage PyTorch TunableOp to accelerate models using ROCm on AMD GPUs. We will discuss the basics of General Matrix Multiplications (GEMMs), show an example of tuning a single GEMM, and finally, demonstrate real-world performance gains on an LLM (gemma) using TunableOp.

Read more ...

Application portability with HIP

26 April 2024

Many scientific applications run on AMD-equipped computing platforms and supercomputers, including Frontier, the first Exascale system in the world. These applications, coming from a myriad of science domains, were ported to run on AMD GPUs using the Heterogeneous-compute Interface for Portability (HIP) abstraction layer. HIP enables these High-Performance Computing (HPC) facilities to transition their CUDA codes to run and take advantage of the latest AMD GPUs. The effort involved in porting these scientific applications varies from a few hours to a few weeks and largely depends on the complexity of the original source code. Figure 1 shows several examples of applications that have been ported and the corresponding porting effort.

Read more ...

Jacobi Solver with HIP and OpenMP offloading

15 September 2023

15 Sept, 2023 by

, , .

Read more ...

Finite difference method - Laplacian part 4

18 July 2023

18 Jul, 2023 by

, , .

Read more ...

Register pressure in AMD CDNA™2 GPUs

17 May 2023

Register pressure in GPU kernels has a tremendous impact on the overall performance of your HPC application. Understanding and controlling register usage allows developers to carefully design codes capable of maximizing hardware resources. The following blog post is focused on a practical demo showing how to apply the recommendations explained in this OLCF training talk presented on August 23rd 2022. Here is the training archive where you can also find the slides. We focus solely on the AMD CDNA™2 architecture (MI200 series GPUs) using ROCm 5.4.

Read more ...

Finite difference method - Laplacian part 3

11 May 2023

11 May, 2023 by

, , , , .

Read more ...

Finite difference method - Laplacian part 2

04 January 2023

4 Jan, 2023 by

, , , , .

Read more ...

AMD matrix cores

14 November 2022

Matrix multiplication is a fundamental aspect of linear algebra and it is an ubiquitous computation within High Performance Computing (HPC) Applications. Since the introduction of AMD’s CDNA Architecture, Generalized Matrix Multiplication (GEMM) computations are now hardware-accelerated through Matrix Core Processing Units. Matrix Core accelerated GEMM kernels lie at the heart of BLAS libraries like rocBLAS but they can also be programmed directly by developers. Applications that are throughput bound by GEMM computation can achieve additional speedups by utilizing Matrix Cores.

Read more ...