Profiling is the backbone of performance optimization in AI and HPC
workloads, enabling developers to extract maximum efficiency from AMD
Instinct™ GPUs. With ROCm’s rapid evolution, the need for a unified,
scalable, and extensible profiling framework has never been more
critical. The new ROCprofiler-SDK framework represents a significant
step forward in profiling capabilities, offering enhanced features,
streamlined integration, and a better user experience while also solving
past limitations with former profiler interface versions. This guide
aims to help users seamlessly transition from legacy profiling tools
to the ROCprofiler-SDK infrastructure. We will explore new features,
highlight key differences from previous tools, and provide actionable
steps for a smooth migration.
Profiling is the backbone of performance optimization in AI and HPC
workloads, enabling developers to extract maximum efficiency from AMD
Instinct™ GPUs. With ROCm’s rapid evolution, the need for a unified,
scalable, and extensible profiling framework has never been more
critical. The new ROCprofiler-SDK framework represents a significant
step forward in profiling capabilities, offering enhanced features,
streamlined integration, and a better user experience while also solving
past limitations with former profiler interface versions. This guide
aims to help users seamlessly transition from legacy profiling tools
to the ROCprofiler-SDK infrastructure. We will explore new features,
highlight key differences from previous tools, and provide actionable
steps for a smooth migration.
In our previous blog post,
we explored the conceptual differences between Peak FLOPs and Max-Achievable FLOPs (MAF), explaining why the gap between these metrics has
widened with modern ML-optimized hardware. This second installment provides a detailed methodology for measuring MAF on AMD GPUs,
including the specific environmental conditions, matrix size optimization techniques, and tools required for accurate measurement.
We present the actual MAF results for AMD Instinct MI300X and MI325X GPUs across different precision formats (FP16, BF16, and FP8)
along with their corresponding median frequencies. We also explain how software efficiency and frequency management impact MAF,
and demonstrate why boost clock capabilities remain important for latency-sensitive workloads such as LLM inference with small batch sizes.
This blog post will introduce you to the advantages of AMD Instinct™ MI300A accelerated processing unit (APU),
discussing the hardware architecture and how to leverage its GPU programming capabilities.
This blog introduces the inner compute and memory architecture of the
AMD Instinct™ MI300,
showing you how to use the MI300 GPU’s different partition modes to supercharge
performance critical applications. In this blog, you will first get a brief
introduction to the MI300 architecture, explaining how the MI300 compute and
memory partitions can be used to your advantage. You will then learn in detail
the compute partitioning modes and the memory partitioning modes, Further, two
case studies demonstrate and benchmark the performance of the different modes.
For convenience this blog uses the MI300X as a case-in-point example.
This blog will guide you, step-by-step, through the process of installing and running benchmarks with Ansys Fluent and AMD MI300X. We start with an overview of the Ansys Fluent CFD application and then show you how to set up an AMD MI300X system to run benchmarks. The blog benchmarks results demonstrate the dramatic impact the MI300X has on speeding up simulations, improving design efficiency, and reducing costs in the automotive, aerospace, and environmental engineering industries.
We are excited to share a brief preview of AMD’s
Next-Gen Fortran Compiler,
our new open source Fortran complier supporting OpenMP offloading. AMD’s
Next-Gen Fortran Compiler
is a downstream flavor of LLVM Flang, optimized for AMD GPUs.
Our Next-Gen Fortran Compiler
enables OpenMP offloading and offers a direct interface to ROCm and HIP.
In this blog post you will:
For system administrators and power users working with AMD hardware, performance optimization and efficient monitoring of resources is paramount. The AMD System Management Interface command-line tool, amd-smi, addresses these needs.
Graphs and graph analytics are related concepts that can help us understand complex
data and relationships. In this context, a graph is a mathematical model that represents entities
(called nodes or vertices) and their connections (called edges or links). And graph analytics
is a form of data analysis that uses graph structures and algorithms to reveal insights
from the data.
TensorFlow Profiler consists of a set of tools designed to measure resource utilization and performance during the execution of TensorFlow models. It offers insights into how a model interacts with hardware resources, including execution time and memory usage. TensorFlow Profiler helps in pinpointing performance bottlenecks, allowing us to fine-tune the execution of models for improved efficiency and faster outcomes which can be crucial in scenarios where near-real-time predictions are required.
For an application developer it is often helpful to read the Instruction
Set Architecture (ISA) for the GPU architecture that is used to perform its
computations. Understanding the instructions of the pertinent code
regions of interest can help in debugging and achieving performance
optimization of the application.
Rocprof is a robust tool designed to analyze and optimize the performance of HIP programs on AMD ROCm platforms, helping developers pinpoint and resolve performance bottlenecks. Rocprof provides a variety of profiling data, including performance counters, hardware traces, and runtime API/activity traces.
Many scientific applications run on AMD-equipped computing platforms and supercomputers,
including Frontier, the first Exascale system in
the world. These applications, coming from a myriad of science domains, were ported to
run on AMD GPUs using the Heterogeneous-compute Interface for Portability (HIP)
abstraction layer. HIP enables these High-Performance Computing (HPC) facilities to
transition their CUDA codes to run and take advantage of the latest AMD GPUs.
The effort involved in porting these scientific applications varies from a few hours
to a few weeks and largely depends on the complexity of the original source code.
Figure 1 shows several examples of applications that have been ported and the
corresponding porting effort.
The C++17 standard added the concept of parallel algorithms to the
pre-existing C++ Standard Library. The parallel version of algorithms like
std::transform maintain the same signature as the regular serial version,
except for the addition of an extra parameter specifying the
executionpolicy to use. This flexibility allows users that are already
using the C++ Standard Library algorithms to take advantage of multi-core
architectures by just introducing minimal changes to their code.
Julia is a high-level, general-purpose
dynamic programming language that automatically compiles to efficient
native code via LLVM, and supports multiple platforms.
With LLVM, comes the support for programming GPUs, including AMD GPUs.
In Part 1 of the Affinity blog series, we looked at the
importance of setting affinity for High Performance Computing (HPC) workloads. In this
blog post, our goals are the following:
Modern hardware architectures are increasingly complex with multiple sockets,
many cores in each Central Processing Unit (CPU), Graphical Processing Units
(GPUs), memory controllers, Network Interface Cards (NICs), etc. Peripherals such as
GPUs or memory controllers will often be local to a CPU socket. Such designs present
interesting challenges in optimizing memory access times, data transfer times, etc.
Depending on how the system is built, hardware components are connected,
and the workload being run, it may be advantageous
to use the resources of the system in a specific way. In this article,
we will discuss the role of affinity, placement, and order in improving performance for
High Performance Computing (HPC) workloads. A short case study is also presented to
familiarize you with performance considerations on a node in the
Frontier supercomputer. In a
follow-up article, we also aim to equip you with the tools you
need to understand your system’s hardware topology and set up affinity for your
application accordingly.
MPI is the de facto standard for inter-process communication in High-Performance
Computing. MPI processes compute on their local data while extensively communicating
with each other. This enables MPI programs to be executed on systems with a distributed
memory space e.g. clusters. There are different types of communications supported
in MPI including point-to-point and collective communications. Point-to-point
communication is the basic communication mechanism in which both the sending
process and the receiving process take part in the communication. The sender
has a buffer that holds the message and an envelope containing information
that will be used by the receiver side (e.g., message tag, the sender rank number,
etc.). The receiver uses the information in the envelope to select the specified
message and stores it in its receiver buffer. In collective communication,
messages can be exchanged among a group of processes rather than just two of
them. Collective communication provides opportunities for processes to perform
one-to-many and many-to-many communications in a convenient, portable and
optimized way. Some examples of collective communications include broadcast,
allgather, alltoall, and allreduce.
Register pressure in GPU kernels has a tremendous impact on the overall performance
of your HPC application. Understanding and controlling register usage allows developers
to carefully design codes capable of maximizing hardware resources. The following blog
post is focused on a practical demo showing how to apply the recommendations explained in this
OLCF training talk presented on August 23rd 2022. Here is the
training archive where you
can also find the slides. We focus solely on the AMD CDNA™2 architecture (MI200 series GPUs)
using ROCm 5.4.
Getting a code to be functionally correct is not always enough. In many industries,
it is also required that applications and their complex software stack run as efficiently
as possible to meet operational demands. This is particularly challenging as hardware
continues to evolve over time, and as a result codes may require further tuning.
In practice, many application developers construct benchmarks, which are carefully designed
to measure the performance, such as execution time, of a particular code within an
operational-like setting. In other words: a good benchmark should be representative
of the real work that needs to be done. These benchmarks are useful in that they provide
insight into the characteristics of the application, and enables one to discover potential
bottlenecks that could result in performance degradation during operational settings.
AMD ROCm™ is the first open-source software development platform for HPC/Hyperscale-class
GPU computing. AMD ROCm™ brings the UNIX philosophy of choice, minimalism and modular software
development to GPU computing. Please see the AMD
Open Software Platform for GPU Compute
and ROCm Informational Portal pages for more information.
Matrix multiplication is a fundamental aspect of linear algebra and it is an
ubiquitous computation within High Performance Computing (HPC) Applications.
Since the introduction of AMD’s CDNA Architecture, Generalized Matrix Multiplication
(GEMM) computations are now hardware-accelerated through Matrix Core Processing
Units. Matrix Core accelerated GEMM kernels lie at the heart of BLAS libraries
like rocBLAS but they can also be programmed directly by developers. Applications
that are throughput bound by GEMM computation can achieve additional speedups by utilizing Matrix Cores.