This blog post will introduce you to the advantages of AMD Instinct™ MI300A accelerated processing unit (APU),
discussing the hardware architecture and how to leverage its GPU programming capabilities.
For an application developer it is often helpful to read the Instruction
Set Architecture (ISA) for the GPU architecture that is used to perform its
computations. Understanding the instructions of the pertinent code
regions of interest can help in debugging and achieving performance
optimization of the application.
The C++17 standard added the concept of parallel algorithms to the
pre-existing C++ Standard Library. The parallel version of algorithms like
std::transform maintain the same signature as the regular serial version,
except for the addition of an extra parameter specifying the
executionpolicy to use. This flexibility allows users that are already
using the C++ Standard Library algorithms to take advantage of multi-core
architectures by just introducing minimal changes to their code.
In Part 1 of the Affinity blog series, we looked at the
importance of setting affinity for High Performance Computing (HPC) workloads. In this
blog post, our goals are the following:
Modern hardware architectures are increasingly complex with multiple sockets,
many cores in each Central Processing Unit (CPU), Graphical Processing Units
(GPUs), memory controllers, Network Interface Cards (NICs), etc. Peripherals such as
GPUs or memory controllers will often be local to a CPU socket. Such designs present
interesting challenges in optimizing memory access times, data transfer times, etc.
Depending on how the system is built, hardware components are connected,
and the workload being run, it may be advantageous
to use the resources of the system in a specific way. In this article,
we will discuss the role of affinity, placement, and order in improving performance for
High Performance Computing (HPC) workloads. A short case study is also presented to
familiarize you with performance considerations on a node in the
Frontier supercomputer. In a
follow-up article, we also aim to equip you with the tools you
need to understand your system’s hardware topology and set up affinity for your
application accordingly.
MPI is the de facto standard for inter-process communication in High-Performance
Computing. MPI processes compute on their local data while extensively communicating
with each other. This enables MPI programs to be executed on systems with a distributed
memory space e.g. clusters. There are different types of communications supported
in MPI including point-to-point and collective communications. Point-to-point
communication is the basic communication mechanism in which both the sending
process and the receiving process take part in the communication. The sender
has a buffer that holds the message and an envelope containing information
that will be used by the receiver side (e.g., message tag, the sender rank number,
etc.). The receiver uses the information in the envelope to select the specified
message and stores it in its receiver buffer. In collective communication,
messages can be exchanged among a group of processes rather than just two of
them. Collective communication provides opportunities for processes to perform
one-to-many and many-to-many communications in a convenient, portable and
optimized way. Some examples of collective communications include broadcast,
allgather, alltoall, and allreduce.
Register pressure in GPU kernels has a tremendous impact on the overall performance
of your HPC application. Understanding and controlling register usage allows developers
to carefully design codes capable of maximizing hardware resources. The following blog
post is focused on a practical demo showing how to apply the recommendations explained in this
OLCF training talk presented on August 23rd 2022. Here is the
training archive where you
can also find the slides. We focus solely on the AMD CDNA™2 architecture (MI200 series GPUs)
using ROCm 5.4.
Getting a code to be functionally correct is not always enough. In many industries,
it is also required that applications and their complex software stack run as efficiently
as possible to meet operational demands. This is particularly challenging as hardware
continues to evolve over time, and as a result codes may require further tuning.
In practice, many application developers construct benchmarks, which are carefully designed
to measure the performance, such as execution time, of a particular code within an
operational-like setting. In other words: a good benchmark should be representative
of the real work that needs to be done. These benchmarks are useful in that they provide
insight into the characteristics of the application, and enables one to discover potential
bottlenecks that could result in performance degradation during operational settings.
Matrix multiplication is a fundamental aspect of linear algebra and it is an
ubiquitous computation within High Performance Computing (HPC) Applications.
Since the introduction of AMD’s CDNA Architecture, Generalized Matrix Multiplication
(GEMM) computations are now hardware-accelerated through Matrix Core Processing
Units. Matrix Core accelerated GEMM kernels lie at the heart of BLAS libraries
like rocBLAS but they can also be programmed directly by developers. Applications
that are throughput bound by GEMM computation can achieve additional speedups by utilizing Matrix Cores.