This blog will introduce you to the updated AMD Docker image, specifically built and optimized for distributed training. As you will see, the optimized AMD ROCm Docker image makes training large AI models faster and more efficient. It includes updates such as better fine-tuning tools, improved performance for multi-GPU setups, and support for FP8 precision, which helps speed up training while using less memory, and can provide you with an overall smoother and more efficient training experience on popular models such as Flux and Llama 3.1 running on AMD GPUs.
Efficient inter-GPU communication is the backbone of high-performance AI
and HPC workloads, where technologies like RCCL and xGMI play pivotal
roles. However, some limitations in achieving theoretical peak bandwidth
have raised questions about performance bottlenecks. In this blog we explain
the limitations to achieve the theoretical maximum bandwidth in multi-GPU clusters,
and teach you how to perform a set of diagnostics and performance-tuning strategies
that will help you optimize RCCL and xGMI bandwidth on AMD MI300X systems. We will
first introduce you to xGMI and its performance constraints, to RCCL and its bandwidth
limitations, and then cover several practical benchmarks and best practices for maximizing RCCL efficiency.
In our previous blog post,
we explored the conceptual differences between Peak FLOPs and Max-Achievable FLOPs (MAF), explaining why the gap between these metrics has
widened with modern ML-optimized hardware. This second installment provides a detailed methodology for measuring MAF on AMD GPUs,
including the specific environmental conditions, matrix size optimization techniques, and tools required for accurate measurement.
We present the actual MAF results for AMD Instinct MI300X and MI325X GPUs across different precision formats (FP16, BF16, and FP8)
along with their corresponding median frequencies. We also explain how software efficiency and frequency management impact MAF,
and demonstrate why boost clock capabilities remain important for latency-sensitive workloads such as LLM inference with small batch sizes.
This blog introduces the inner compute and memory architecture of the
AMD Instinct™ MI300,
showing you how to use the MI300 GPU’s different partition modes to supercharge
performance critical applications. In this blog, you will first get a brief
introduction to the MI300 architecture, explaining how the MI300 compute and
memory partitions can be used to your advantage. You will then learn in detail
the compute partitioning modes and the memory partitioning modes, Further, two
case studies demonstrate and benchmark the performance of the different modes.
For convenience this blog uses the MI300X as a case-in-point example.
We are excited to share a brief preview of AMD’s
Next-Gen Fortran Compiler,
our new open source Fortran complier supporting OpenMP offloading. AMD’s
Next-Gen Fortran Compiler
is a downstream flavor of LLVM Flang, optimized for AMD GPUs.
Our Next-Gen Fortran Compiler
enables OpenMP offloading and offers a direct interface to ROCm and HIP.
In this blog post you will:
With the scale of large language models (LLMs) reaching hundred of billions of parameters, the ways we represent data within these enormous models dramatically impacts the resources required to train them (e.g. the number of GPUs needed for inference).
In our previous blogs (JAX mixed precision training; PyTorch AMP), we already demonstrated how mixed precision training can accelerate LLMs training process. In this blog post we will push things further and show you how quantization into an even lower precision data formats can speed up inference, saving time and memory, without sacrificing the overall performance of the model.
Quantization is a technique where the precision of a model’s parameters is reduced from a 32-bit floating point (FP32) or a 16-bit floating point (FP16) to an 8-bit integer (INT8). Standard models typically use 32-bit floating-point (FP32) precision. However, this higher precision is not always necessary for inference tasks. By converting model weights and activations to lower precision formats like INT8 (8-bit integer), we can achieve faster computations and lower memory usage, effectively reducing the model size by three-fourths (from 32-bit) or half (from 16-bit) with only a slight accuracy reduction, which is often outweighed by the speed gains.
For system administrators and power users working with AMD hardware, performance optimization and efficient monitoring of resources is paramount. The AMD System Management Interface command-line tool, amd-smi, addresses these needs.
Graphs and graph analytics are related concepts that can help us understand complex
data and relationships. In this context, a graph is a mathematical model that represents entities
(called nodes or vertices) and their connections (called edges or links). And graph analytics
is a form of data analysis that uses graph structures and algorithms to reveal insights
from the data.
The C++17 standard added the concept of parallel algorithms to the
pre-existing C++ Standard Library. The parallel version of algorithms like
std::transform maintain the same signature as the regular serial version,
except for the addition of an extra parameter specifying the
executionpolicy to use. This flexibility allows users that are already
using the C++ Standard Library algorithms to take advantage of multi-core
architectures by just introducing minimal changes to their code.
In Part 1 of the Affinity blog series, we looked at the
importance of setting affinity for High Performance Computing (HPC) workloads. In this
blog post, our goals are the following:
Modern hardware architectures are increasingly complex with multiple sockets,
many cores in each Central Processing Unit (CPU), Graphical Processing Units
(GPUs), memory controllers, Network Interface Cards (NICs), etc. Peripherals such as
GPUs or memory controllers will often be local to a CPU socket. Such designs present
interesting challenges in optimizing memory access times, data transfer times, etc.
Depending on how the system is built, hardware components are connected,
and the workload being run, it may be advantageous
to use the resources of the system in a specific way. In this article,
we will discuss the role of affinity, placement, and order in improving performance for
High Performance Computing (HPC) workloads. A short case study is also presented to
familiarize you with performance considerations on a node in the
Frontier supercomputer. In a
follow-up article, we also aim to equip you with the tools you
need to understand your system’s hardware topology and set up affinity for your
application accordingly.
MPI is the de facto standard for inter-process communication in High-Performance
Computing. MPI processes compute on their local data while extensively communicating
with each other. This enables MPI programs to be executed on systems with a distributed
memory space e.g. clusters. There are different types of communications supported
in MPI including point-to-point and collective communications. Point-to-point
communication is the basic communication mechanism in which both the sending
process and the receiving process take part in the communication. The sender
has a buffer that holds the message and an envelope containing information
that will be used by the receiver side (e.g., message tag, the sender rank number,
etc.). The receiver uses the information in the envelope to select the specified
message and stores it in its receiver buffer. In collective communication,
messages can be exchanged among a group of processes rather than just two of
them. Collective communication provides opportunities for processes to perform
one-to-many and many-to-many communications in a convenient, portable and
optimized way. Some examples of collective communications include broadcast,
allgather, alltoall, and allreduce.