This blog will introduce you to the updated AMD Docker image, specifically built and optimized for distributed training. As you will see, the optimized AMD ROCm Docker image makes training large AI models faster and more efficient. It includes updates such as better fine-tuning tools, improved performance for multi-GPU setups, and support for FP8 precision, which helps speed up training while using less memory, and can provide you with an overall smoother and more efficient training experience on popular models such as Flux and Llama 3.1 running on AMD GPUs.
Efficient inter-GPU communication is the backbone of high-performance AI
and HPC workloads, where technologies like RCCL and xGMI play pivotal
roles. However, some limitations in achieving theoretical peak bandwidth
have raised questions about performance bottlenecks. In this blog we explain
the limitations to achieve the theoretical maximum bandwidth in multi-GPU clusters,
and teach you how to perform a set of diagnostics and performance-tuning strategies
that will help you optimize RCCL and xGMI bandwidth on AMD MI300X systems. We will
first introduce you to xGMI and its performance constraints, to RCCL and its bandwidth
limitations, and then cover several practical benchmarks and best practices for maximizing RCCL efficiency.
In our previous blog post,
we explored the conceptual differences between Peak FLOPs and Max-Achievable FLOPs (MAF), explaining why the gap between these metrics has
widened with modern ML-optimized hardware. This second installment provides a detailed methodology for measuring MAF on AMD GPUs,
including the specific environmental conditions, matrix size optimization techniques, and tools required for accurate measurement.
We present the actual MAF results for AMD Instinct MI300X and MI325X GPUs across different precision formats (FP16, BF16, and FP8)
along with their corresponding median frequencies. We also explain how software efficiency and frequency management impact MAF,
and demonstrate why boost clock capabilities remain important for latency-sensitive workloads such as LLM inference with small batch sizes.
With the scale of large language models (LLMs) reaching hundred of billions of parameters, the ways we represent data within these enormous models dramatically impacts the resources required to train them (e.g. the number of GPUs needed for inference).
In our previous blogs (JAX mixed precision training; PyTorch AMP), we already demonstrated how mixed precision training can accelerate LLMs training process. In this blog post we will push things further and show you how quantization into an even lower precision data formats can speed up inference, saving time and memory, without sacrificing the overall performance of the model.
Quantization is a technique where the precision of a model’s parameters is reduced from a 32-bit floating point (FP32) or a 16-bit floating point (FP16) to an 8-bit integer (INT8). Standard models typically use 32-bit floating-point (FP32) precision. However, this higher precision is not always necessary for inference tasks. By converting model weights and activations to lower precision formats like INT8 (8-bit integer), we can achieve faster computations and lower memory usage, effectively reducing the model size by three-fourths (from 32-bit) or half (from 16-bit) with only a slight accuracy reduction, which is often outweighed by the speed gains.
For system administrators and power users working with AMD hardware, performance optimization and efficient monitoring of resources is paramount. The AMD System Management Interface command-line tool, amd-smi, addresses these needs.
In this blog we explore how to fine-tune the Robustly Optimized BERT Pretraining Approach (RoBERTa) large language model, with emphasis on PyTorch’s mixed precision capabilities. Specifically, we explore using AMD GPUs for mixed precision fine-tuning to achieve faster model training without any major impacts on accuracy.
This blog provides a comprehensive guide on measuring and comparing the performance of various algorithms in a JAX-implemented generative AI model. Leveraging the JAX Profiler and statistical analysis, this blog demonstrates how to reliably evaluate key steps and compare algorithm performance on AMD GPUs.
In this blog, we will show how to leverage PyTorch TunableOp to accelerate models using ROCm on AMD GPUs.
We will discuss the basics of General Matrix Multiplications (GEMMs), show an example of tuning a single GEMM, and finally, demonstrate real-world performance gains on an LLM (gemma) using TunableOp.
Many scientific applications run on AMD-equipped computing platforms and supercomputers,
including Frontier, the first Exascale system in
the world. These applications, coming from a myriad of science domains, were ported to
run on AMD GPUs using the Heterogeneous-compute Interface for Portability (HIP)
abstraction layer. HIP enables these High-Performance Computing (HPC) facilities to
transition their CUDA codes to run and take advantage of the latest AMD GPUs.
The effort involved in porting these scientific applications varies from a few hours
to a few weeks and largely depends on the complexity of the original source code.
Figure 1 shows several examples of applications that have been ported and the
corresponding porting effort.
Register pressure in GPU kernels has a tremendous impact on the overall performance
of your HPC application. Understanding and controlling register usage allows developers
to carefully design codes capable of maximizing hardware resources. The following blog
post is focused on a practical demo showing how to apply the recommendations explained in this
OLCF training talk presented on August 23rd 2022. Here is the
training archive where you
can also find the slides. We focus solely on the AMD CDNA™2 architecture (MI200 series GPUs)
using ROCm 5.4.
Matrix multiplication is a fundamental aspect of linear algebra and it is an
ubiquitous computation within High Performance Computing (HPC) Applications.
Since the introduction of AMD’s CDNA Architecture, Generalized Matrix Multiplication
(GEMM) computations are now hardware-accelerated through Matrix Core Processing
Units. Matrix Core accelerated GEMM kernels lie at the heart of BLAS libraries
like rocBLAS but they can also be programmed directly by developers. Applications
that are throughput bound by GEMM computation can achieve additional speedups by utilizing Matrix Cores.