Posts tagged Compiler

ROCm 7.0: An AI-Ready Powerhouse for Performance, Efficiency, and Productivity

16 September 2025

Artificial intelligence now defines the performance envelope for modern computation. In this blog, we introduce the AI-centric ROCm 7.0 designed to help our community directly benefit from this dramatic paradigm shift. ROCm 7.0 delivers a platform purpose-built for the era of generative AI, large-scale inference and training, and accelerated discovery, helping you boost the performance, efficiency, and scalability of your workloads.

Read more ...

ROCm Revisited: Evolution of the High-Performance GPU Computing Ecosystem

06 June 2025

09 June 2025

This blog is part of our ROCm Revisited series [1]. The purpose of this series is to share the story of ROCm and our journey through the changes and successes we’ve achieved over the past few years. We’ll explore the key milestones in our development, the innovative technologies that have propelled us forward, and the challenges we’ve overcome to establish our leadership in the world of GPU computing.

Read more ...

HIP 7.0 Is Coming: What You Need to Know to Stay Ahead

28 May 2025

20 June 2025

At AMD, we understand that code portability between AMD and NVIDIA GPU programming models is top of mind for our customers. We are committed to making GPU development more seamless and portable across vendors. With the upcoming HIP 7.0 release in second half of 2025, we’re taking a bold step toward simplifying cross-platform programming by aligning HIP C++ even more closely with CUDA. AMD tightly integrates our automatic HIPIFY conversion tool with our HIP runtime and compiler. Users can quickly port CUDA code into HIP C++ with HIPIFY to target AMD GPUs. However, small differences between our implementation of the HIP C++ programming model and CUDA C++ often require manual intervention to adjust your code base. This causes additional work for software developers targeting GPU families from both providers. We understand this and are making changes to ROCm to reduce this friction based on customer requests. We also know adopting changes in our programming model requires early notification. We don’t take API breaking changes lightly and for your benefit, we are making an early prototype available to assist in porting to the new HIP 7.0 API. The preview release is based on ROCm 6.4.1 release for functionality but contains 7.0 API previews. It is intended as a drop-in replacement for 6.4.1 intended for non-production use, enabling users to write code with the new API and adopt HIP 7.0 more smoothly. In this blog, you will learn how HIP 7.0 aligns more closely with CUDA, what API and behavior changes to expect, and how to prepare your codebase to ensure compatibility and portability across GPU platforms. Let’s delve into the details of the API changes.

Read more ...

Unleash Full GPU Potential: Overlap Communication and Computation with Triton-Distributed

06 May 2025

In distributed computing, AI workloads demand both massive parallelism and efficient data movement. A primary challenge lies in effectively overlapping computation with communication to maximize performance. GPUs are excellent at crunching numbers. However, their full potential often remains untapped due to relatively long inter-GPU communication. This results in their computing units staying idle for large amounts of time while waiting for data transfer from other nodes. In this blog, we will show how you can use the Triton-Distributed framework to generate kernels that overlap communication and computation, resulting in performance that can rival highly optimized libraries.

Read more ...

MI300A - Exploring the APU advantage

09 February 2025

This blog post will introduce you to the advantages of AMD Instinct™ MI300A accelerated processing unit (APU), discussing the hardware architecture and how to leverage its GPU programming capabilities.

Read more ...

Introducing AMD’s Next-Gen Fortran Compiler

13 November 2024

We are excited to share a brief preview of AMD’s Next-Gen Fortran Compiler, our new open source Fortran complier supporting OpenMP offloading. AMD’s Next-Gen Fortran Compiler is a downstream flavor of LLVM Flang, optimized for AMD GPUs. Our Next-Gen Fortran Compiler enables OpenMP offloading and offers a direct interface to ROCm and HIP. In this blog post you will:

Read more ...

Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm

11 July 2024

PyTorch 2.0 introduces torch.compile(), a tool to vastly accelerate PyTorch code and models. By converting PyTorch code into highly optimized kernels, torch.compile delivers substantial performance improvements with minimal changes to the existing codebase. This feature allows for precise optimization of individual functions, entire modules, and complex training loops, providing a versatile and powerful tool for enhancing computational efficiency.

Read more ...

Reading AMD GPU ISA

13 May 2024

For an application developer it is often helpful to read the Instruction Set Architecture (ISA) for the GPU architecture that is used to perform its computations. Understanding the instructions of the pertinent code regions of interest can help in debugging and achieving performance optimization of the application.

Read more ...

Application portability with HIP

26 April 2024

Many scientific applications run on AMD-equipped computing platforms and supercomputers, including Frontier, the first Exascale system in the world. These applications, coming from a myriad of science domains, were ported to run on AMD GPUs using the Heterogeneous-compute Interface for Portability (HIP) abstraction layer. HIP enables these High-Performance Computing (HPC) facilities to transition their CUDA codes to run and take advantage of the latest AMD GPUs. The effort involved in porting these scientific applications varies from a few hours to a few weeks and largely depends on the complexity of the original source code. Figure 1 shows several examples of applications that have been ported and the corresponding porting effort.

Read more ...

C++17 parallel algorithms and HIPSTDPAR

18 April 2024

The C++17 standard added the concept of parallel algorithms to the pre-existing C++ Standard Library. The parallel version of algorithms like std::transform maintain the same signature as the regular serial version, except for the addition of an extra parameter specifying the execution policy to use. This flexibility allows users that are already using the C++ Standard Library algorithms to take advantage of multi-core architectures by just introducing minimal changes to their code.

Read more ...

GPU-aware MPI with ROCm

08 June 2023

MPI is the de facto standard for inter-process communication in High-Performance Computing. MPI processes compute on their local data while extensively communicating with each other. This enables MPI programs to be executed on systems with a distributed memory space e.g. clusters. There are different types of communications supported in MPI including point-to-point and collective communications. Point-to-point communication is the basic communication mechanism in which both the sending process and the receiving process take part in the communication. The sender has a buffer that holds the message and an envelope containing information that will be used by the receiver side (e.g., message tag, the sender rank number, etc.). The receiver uses the information in the envelope to select the specified message and stores it in its receiver buffer. In collective communication, messages can be exchanged among a group of processes rather than just two of them. Collective communication provides opportunities for processes to perform one-to-many and many-to-many communications in a convenient, portable and optimized way. Some examples of collective communications include broadcast, allgather, alltoall, and allreduce.

Read more ...

Register pressure in AMD CDNA™2 GPUs

17 May 2023

Register pressure in GPU kernels has a tremendous impact on the overall performance of your HPC application. Understanding and controlling register usage allows developers to carefully design codes capable of maximizing hardware resources. The following blog post is focused on a practical demo showing how to apply the recommendations explained in this OLCF training talk presented on August 23rd 2022. Here is the training archive where you can also find the slides. We focus solely on the AMD CDNA™2 architecture (MI200 series GPUs) using ROCm 5.4.

Read more ...

Finite difference method - Laplacian part 3

11 May 2023

11 May, 2023 by

, , , , .

Read more ...

AMD matrix cores

14 November 2022

Matrix multiplication is a fundamental aspect of linear algebra and it is an ubiquitous computation within High Performance Computing (HPC) Applications. Since the introduction of AMD’s CDNA Architecture, Generalized Matrix Multiplication (GEMM) computations are now hardware-accelerated through Matrix Core Processing Units. Matrix Core accelerated GEMM kernels lie at the heart of BLAS libraries like rocBLAS but they can also be programmed directly by developers. Applications that are throughput bound by GEMM computation can achieve additional speedups by utilizing Matrix Cores.

Read more ...