Posts tagged HPC

Performance Profiling on AMD GPUs – Part 1: Foundations

26 June 2025

Error parsing meta tag attribute “keywords”: No content.

Read more ...

AMD ROCm: Powering the World’s Fastest Supercomputers

10 June 2025

From breaking the exaFLOP barrier with Frontier to setting new performance records with El Capitan, AMD is transforming what’s possible in high-performance computing (HPC). But the story goes beyond hardware. At the core of these world-class systems is ROCm, AMD’s open, high-performance software platform enabling new levels of scientific discovery and AI advancement.

Read more ...

The ROCm Revisited Series

06 June 2025

The ROCm Revisited series aims to revisit key concepts of the AMD ROCm software platform, tools, and optimizations, tailored for beginner and intermediate developers. This series shares our journey through the evolution of ROCm, highlighting the milestones, innovative technologies, and challenges we’ve overcome to establish leadership in the supercomputing space. Each post explores different aspects of ROCm’s development, focusing on how it has transformed industries, particularly in AI, machine learning, and high-performance computing (HPC). Through these blog posts, we’ll also discuss our commitment to open-source development and the future potential of distributed and energy-efficient computing. Below are the three blogs included in the series:

Read more ...

ROCm Revisited: Getting Started with HIP

06 June 2025

This blog is part of our ROCm Revisited series [1]. The purpose of this series is to share the story of ROCm and our journey through the changes and successes we’ve achieved over the past few years.

Read more ...

ROCm Revisited: Evolution of the High-Performance GPU Computing Ecosystem

06 June 2025

09 June 2025

This blog is part of our ROCm Revisited series [1]. The purpose of this series is to share the story of ROCm and our journey through the changes and successes we’ve achieved over the past few years. We’ll explore the key milestones in our development, the innovative technologies that have propelled us forward, and the challenges we’ve overcome to establish our leadership in the world of GPU computing.

Read more ...

HIP 7.0 Is Coming: What You Need to Know to Stay Ahead

28 May 2025

20 June 2025

At AMD, we understand that code portability between AMD and NVIDIA GPU programming models is top of mind for our customers. We are committed to making GPU development more seamless and portable across vendors. With the upcoming HIP 7.0 release in second half of 2025, we’re taking a bold step toward simplifying cross-platform programming by aligning HIP C++ even more closely with CUDA. AMD tightly integrates our automatic HIPIFY conversion tool with our HIP runtime and compiler. Users can quickly port CUDA code into HIP C++ with HIPIFY to target AMD GPUs. However, small differences between our implementation of the HIP C++ programming model and CUDA C++ often require manual intervention to adjust your code base. This causes additional work for software developers targeting GPU families from both providers. We understand this and are making changes to ROCm to reduce this friction based on customer requests. We also know adopting changes in our programming model requires early notification. We don’t take API breaking changes lightly and for your benefit, we are making an early prototype available to assist in porting to the new HIP 7.0 API. The preview release is based on ROCm 6.4.1 release for functionality but contains 7.0 API previews. It is intended as a drop-in replacement for 6.4.1 intended for non-production use, enabling users to write code with the new API and adopt HIP 7.0 more smoothly. In this blog, you will learn how HIP 7.0 aligns more closely with CUDA, what API and behavior changes to expect, and how to prepare your codebase to ensure compatibility and portability across GPU platforms. Let’s delve into the details of the API changes.

Read more ...

Introducing ROCm-DS: GPU-Accelerated Data Science for AMD Instinct™ GPUs

20 May 2025

AMD is excited to announce the early access release of ROCm-DS (ROCm Data Science), a new toolkit designed to accelerate data processing workloads on AMD Instinct™ GPUs. Built on the core ROCm toolkit, ROCm-DS promises to significantly enhance performance and scalability for data-intensive applications, catering to the pressing needs of today’s data-driven landscape. ROCm-DS is based on the open source libraries in the RAPIDS ecosystem. This collection of libraries enables a multitude of data processing operations, allowing new and existing workloads to tap into the computational advantages offered by AMD Instinct Datacenter GPUs. This early access release introduces two powerful new libraries: hipDF and hipGRAPH.

Read more ...

Installing ROCm from source with Spack

14 April 2025

In this guide you will learn how Spack makes building ROCm components from source easier and more flexible than other methods. This blog will walk you through installing ROCm from source using the Spack package manager. We will also discuss Spack’s place among other ROCm installation methods, the landscape of ROCm components, and show you how ROCm, as an open-source software platform, allows developers to streamline software stacks for their applications.

Read more ...

ROCm 6.4: Breaking Barriers in AI, HPC, and Modular GPU Software

11 April 2025

In the rapidly evolving landscape of high-performance computing and artificial intelligence, innovation is the currency of progress. AMD’s ROCm 6.4 isn’t just another software update—it’s a leap forward that redefines the boundaries of what is possible for AI, developers, researchers, and enterprise innovators.

Read more ...

Introducing ROCprofiler SDK - The Latest Toolkit for Performance Profiling

25 March 2025

Profiling is the backbone of performance optimization in AI and HPC workloads, enabling developers to extract maximum efficiency from AMD Instinct™ GPUs. With ROCm’s rapid evolution, the need for a unified, scalable, and extensible profiling framework has never been more critical. The new ROCprofiler-SDK framework represents a significant step forward in profiling capabilities, offering enhanced features, streamlined integration, and a better user experience while also solving past limitations with former profiler interface versions. This guide aims to help users seamlessly transition from legacy profiling tools to the ROCprofiler-SDK infrastructure. We will explore new features, highlight key differences from previous tools, and provide actionable steps for a smooth migration.

Read more ...

Introducing ROCprofiler SDK - The Latest Toolkit for Performance Profiling

25 March 2025

Profiling is the backbone of performance optimization in AI and HPC workloads, enabling developers to extract maximum efficiency from AMD Instinct™ GPUs. With ROCm’s rapid evolution, the need for a unified, scalable, and extensible profiling framework has never been more critical. The new ROCprofiler-SDK framework represents a significant step forward in profiling capabilities, offering enhanced features, streamlined integration, and a better user experience while also solving past limitations with former profiler interface versions. This guide aims to help users seamlessly transition from legacy profiling tools to the ROCprofiler-SDK infrastructure. We will explore new features, highlight key differences from previous tools, and provide actionable steps for a smooth migration.

Read more ...

Measuring Max-Achievable FLOPs – Part 2

28 February 2025

In our previous blog post, we explored the conceptual differences between Peak FLOPs and Max-Achievable FLOPs (MAF), explaining why the gap between these metrics has widened with modern ML-optimized hardware. This second installment provides a detailed methodology for measuring MAF on AMD GPUs, including the specific environmental conditions, matrix size optimization techniques, and tools required for accurate measurement. We present the actual MAF results for AMD Instinct MI300X and MI325X GPUs across different precision formats (FP16, BF16, and FP8) along with their corresponding median frequencies. We also explain how software efficiency and frequency management impact MAF, and demonstrate why boost clock capabilities remain important for latency-sensitive workloads such as LLM inference with small batch sizes.

Read more ...

MI300A - Exploring the APU advantage

09 February 2025

This blog post will introduce you to the advantages of AMD Instinct™ MI300A accelerated processing unit (APU), discussing the hardware architecture and how to leverage its GPU programming capabilities.

Read more ...

Deep dive into the MI300 compute and memory partition modes

09 February 2025

This blog introduces the inner compute and memory architecture of the AMD Instinct™ MI300, showing you how to use the MI300 GPU’s different partition modes to supercharge performance critical applications. In this blog, you will first get a brief introduction to the MI300 architecture, explaining how the MI300 compute and memory partitions can be used to your advantage. You will then learn in detail the compute partitioning modes and the memory partitioning modes, Further, two case studies demonstrate and benchmark the performance of the different modes. For convenience this blog uses the MI300X as a case-in-point example.

Read more ...

Boosting Computational Fluid Dynamics Performance with AMD Instinct™ MI300X

14 January 2025

This blog will guide you, step-by-step, through the process of installing and running benchmarks with Ansys Fluent and AMD MI300X. We start with an overview of the Ansys Fluent CFD application and then show you how to set up an AMD MI300X system to run benchmarks. The blog benchmarks results demonstrate the dramatic impact the MI300X has on speeding up simulations, improving design efficiency, and reducing costs in the automotive, aerospace, and environmental engineering industries.

Read more ...

Introducing AMD’s Next-Gen Fortran Compiler

13 November 2024

We are excited to share a brief preview of AMD’s Next-Gen Fortran Compiler, our new open source Fortran complier supporting OpenMP offloading. AMD’s Next-Gen Fortran Compiler is a downstream flavor of LLVM Flang, optimized for AMD GPUs. Our Next-Gen Fortran Compiler enables OpenMP offloading and offers a direct interface to ROCm and HIP. In this blog post you will:

Read more ...

Getting to Know Your GPU: A Deep Dive into AMD SMI

17 September 2024

For system administrators and power users working with AMD hardware, performance optimization and efficient monitoring of resources is paramount. The AMD System Management Interface command-line tool, amd-smi, addresses these needs.

Read more ...

Introducing the AMD ROCm™ Offline Installer Creator: Simplifying Deployment for AI and HPC

10 September 2024

Document headings start at H2, not H1 [myst.header]

Read more ...

Seismic stencil codes - part 3

29 August 2024

12 Aug, 2024 by

and .

Read more ...

Seismic stencil codes - part 2

29 August 2024

12 Aug, 2024 by

and .

Read more ...

Seismic stencil codes - part 1

29 August 2024

12 Aug, 2024 by

and .

Read more ...

Graph analytics on AMD GPUs using Gunrock

29 July 2024

Graphs and graph analytics are related concepts that can help us understand complex data and relationships. In this context, a graph is a mathematical model that represents entities (called nodes or vertices) and their connections (called edges or links). And graph analytics is a form of data analysis that uses graph structures and algorithms to reveal insights from the data.

Read more ...

TensorFlow Profiler in practice: Optimizing TensorFlow models on AMD GPUs

18 June 2024

TensorFlow Profiler consists of a set of tools designed to measure resource utilization and performance during the execution of TensorFlow models. It offers insights into how a model interacts with hardware resources, including execution time and memory usage. TensorFlow Profiler helps in pinpointing performance bottlenecks, allowing us to fine-tune the execution of models for improved efficiency and faster outcomes which can be crucial in scenarios where near-real-time predictions are required.

Read more ...

Reading AMD GPU ISA

13 May 2024

For an application developer it is often helpful to read the Instruction Set Architecture (ISA) for the GPU architecture that is used to perform its computations. Understanding the instructions of the pertinent code regions of interest can help in debugging and achieving performance optimization of the application.

Read more ...

AMD in Action: Unveiling the Power of Application Tracing and Profiling

07 May 2024

Rocprof is a robust tool designed to analyze and optimize the performance of HIP programs on AMD ROCm platforms, helping developers pinpoint and resolve performance bottlenecks. Rocprof provides a variety of profiling data, including performance counters, hardware traces, and runtime API/activity traces.

Read more ...

Application portability with HIP

26 April 2024

Many scientific applications run on AMD-equipped computing platforms and supercomputers, including Frontier, the first Exascale system in the world. These applications, coming from a myriad of science domains, were ported to run on AMD GPUs using the Heterogeneous-compute Interface for Portability (HIP) abstraction layer. HIP enables these High-Performance Computing (HPC) facilities to transition their CUDA codes to run and take advantage of the latest AMD GPUs. The effort involved in porting these scientific applications varies from a few hours to a few weeks and largely depends on the complexity of the original source code. Figure 1 shows several examples of applications that have been ported and the corresponding porting effort.

Read more ...

C++17 parallel algorithms and HIPSTDPAR

18 April 2024

The C++17 standard added the concept of parallel algorithms to the pre-existing C++ Standard Library. The parallel version of algorithms like std::transform maintain the same signature as the regular serial version, except for the addition of an extra parameter specifying the execution policy to use. This flexibility allows users that are already using the C++ Standard Library algorithms to take advantage of multi-core architectures by just introducing minimal changes to their code.

Read more ...

Programming AMD GPUs with Julia

16 April 2024

Julia is a high-level, general-purpose dynamic programming language that automatically compiles to efficient native code via LLVM, and supports multiple platforms. With LLVM, comes the support for programming GPUs, including AMD GPUs.

Read more ...

Affinity part 2 - System topology and controlling affinity

16 April 2024

In Part 1 of the Affinity blog series, we looked at the importance of setting affinity for High Performance Computing (HPC) workloads. In this blog post, our goals are the following:

Read more ...

Affinity part 1 - Affinity, placement, and order

16 April 2024

Modern hardware architectures are increasingly complex with multiple sockets, many cores in each Central Processing Unit (CPU), Graphical Processing Units (GPUs), memory controllers, Network Interface Cards (NICs), etc. Peripherals such as GPUs or memory controllers will often be local to a CPU socket. Such designs present interesting challenges in optimizing memory access times, data transfer times, etc. Depending on how the system is built, hardware components are connected, and the workload being run, it may be advantageous to use the resources of the system in a specific way. In this article, we will discuss the role of affinity, placement, and order in improving performance for High Performance Computing (HPC) workloads. A short case study is also presented to familiarize you with performance considerations on a node in the Frontier supercomputer. In a follow-up article, we also aim to equip you with the tools you need to understand your system’s hardware topology and set up affinity for your application accordingly.

Read more ...

Sparse matrix vector multiplication - part 1

03 November 2023

3 Nov, 2023 by

.

Read more ...

Jacobi Solver with HIP and OpenMP offloading

15 September 2023

15 Sept, 2023 by

, , .

Read more ...

Finite difference method - Laplacian part 4

18 July 2023

18 Jul, 2023 by

, , .

Read more ...

GPU-aware MPI with ROCm

08 June 2023

MPI is the de facto standard for inter-process communication in High-Performance Computing. MPI processes compute on their local data while extensively communicating with each other. This enables MPI programs to be executed on systems with a distributed memory space e.g. clusters. There are different types of communications supported in MPI including point-to-point and collective communications. Point-to-point communication is the basic communication mechanism in which both the sending process and the receiving process take part in the communication. The sender has a buffer that holds the message and an envelope containing information that will be used by the receiver side (e.g., message tag, the sender rank number, etc.). The receiver uses the information in the envelope to select the specified message and stores it in its receiver buffer. In collective communication, messages can be exchanged among a group of processes rather than just two of them. Collective communication provides opportunities for processes to perform one-to-many and many-to-many communications in a convenient, portable and optimized way. Some examples of collective communications include broadcast, allgather, alltoall, and allreduce.

Read more ...

Register pressure in AMD CDNA™2 GPUs

17 May 2023

Register pressure in GPU kernels has a tremendous impact on the overall performance of your HPC application. Understanding and controlling register usage allows developers to carefully design codes capable of maximizing hardware resources. The following blog post is focused on a practical demo showing how to apply the recommendations explained in this OLCF training talk presented on August 23rd 2022. Here is the training archive where you can also find the slides. We focus solely on the AMD CDNA™2 architecture (MI200 series GPUs) using ROCm 5.4.

Read more ...

Finite difference method - Laplacian part 3

11 May 2023

11 May, 2023 by

, , , , .

Read more ...

Introduction to profiling tools for AMD hardware

12 April 2023

Getting a code to be functionally correct is not always enough. In many industries, it is also required that applications and their complex software stack run as efficiently as possible to meet operational demands. This is particularly challenging as hardware continues to evolve over time, and as a result codes may require further tuning. In practice, many application developers construct benchmarks, which are carefully designed to measure the performance, such as execution time, of a particular code within an operational-like setting. In other words: a good benchmark should be representative of the real work that needs to be done. These benchmarks are useful in that they provide insight into the characteristics of the application, and enables one to discover potential bottlenecks that could result in performance degradation during operational settings.

Read more ...

AMD Instinct™ MI200 GPU memory space overview

09 March 2023

The HIP API supports a wide variety of allocation methods for host and device memory on accelerated systems. In this post, we will:

Read more ...

AMD ROCm™ installation

26 January 2023

AMD ROCm™ is the first open-source software development platform for HPC/Hyperscale-class GPU computing. AMD ROCm™ brings the UNIX philosophy of choice, minimalism and modular software development to GPU computing. Please see the AMD Open Software Platform for GPU Compute and ROCm Informational Portal pages for more information.

Read more ...

Finite difference method - Laplacian part 2

04 January 2023

4 Jan, 2023 by

, , , , .

Read more ...

Finite difference method - Laplacian part 1

14 November 2022

14 Nov, 2022 by

, , , , .

Read more ...

AMD matrix cores

14 November 2022

Matrix multiplication is a fundamental aspect of linear algebra and it is an ubiquitous computation within High Performance Computing (HPC) Applications. Since the introduction of AMD’s CDNA Architecture, Generalized Matrix Multiplication (GEMM) computations are now hardware-accelerated through Matrix Core Processing Units. Matrix Core accelerated GEMM kernels lie at the heart of BLAS libraries like rocBLAS but they can also be programmed directly by developers. Applications that are throughput bound by GEMM computation can achieve additional speedups by utilizing Matrix Cores.

Read more ...