Posts in English

Distributed Data Parallel training on AMD GPU with ROCm

With the increase in complexity and size of machine learning models, the demand for computational resources grows. Training on a single GPU can become a bottleneck for deep learning applications, especially with large datasets and models that are slow to train on a single GPU. Parallelized training addresses this challenge. Out of the various forms of parallelized training, this blog focuses on Distributed Data Parallel (DDP), a key feature in PyTorch that accelerates training across multiple GPUs and nodes.

Read more ...


Torchtune on AMD GPUs How-To Guide: Fine-tuning and Scaling LLMs with Multi-GPU Power

This blog provides a thorough how-to guide on using Torchtune to fine-tune and scale large language models (LLMs) with AMD GPUs. Torchtune is a PyTorch library designed to let you easily fine-tune and experiment with LLMs. Using Torchtune’s flexibility and scalability, we show you how to fine-tune the Llama-3.1-8B model for summarization tasks using the EdinburghNLP/xsum dataset. Using LoRA(Low-Rank Adaptation), a parameter-efficient fine-tuning technique, Torchtune enables efficient training while maintaining performance across a different number of GPUs (2, 4, 6, and 8). This post also highlights how Torchtune’s distributed training capabilities allow users to scale up LLM fine-tuning on multiple GPUs to reduce training time while maintaining the quality of the trained model, demonstrating its potential and usage on modern AMD hardware using ROCm.

Read more ...


CTranslate2: Efficient Inference with Transformer Models on AMD GPUs

Transformer models have revolutionized natural language processing (NLP) by delivering high-performance results in tasks like machine translation, text summarization, text generation, and speech recognition. However, deploying these models in production can be challenging due to their high computational and memory requirements. CTranslate2 addresses these challenges by providing a custom runtime that implements various optimization techniques to accelerate Transformer models during inference.

Read more ...


Inference with Llama 3.2 Vision LLMs on AMD GPUs Using ROCm

Meta’s Llama models now support multimodal capabilities, expanding their functionality beyond traditional text-only applications. The Llama 3.2 models are available in a range of sizes, including medium-sized 11B and 90B multimodal models for vision-text reasoning tasks, and lightweight 1B and 3B text-only models designed for edge and mobile devices.

Read more ...


Speed Up Text Generation with Speculative Sampling on AMD GPUs

As the size of transformer models grow, so does the cost of conducting inference, impacting latency and throughput. Compression methods such as quantization and distillation, as well as hardware-aware optimizations such as Flash Attention and Triton, have been proposed to cut down the computation cost at different levels. However, these models either compromise on accuracy or require major changes to the model implementation.

Read more ...


Multinode Fine-Tuning of Stable Diffusion XL on AMD GPUs with Hugging Face Accelerate and OCI’s Kubernetes Engine (OKE)

As the scale and complexity of generative AI and deep learning models grow, multinode training, basically dividing a training job across several processors, has become an essential strategy to speed up training and fine-tuning processes of large generative AI models like SDXL. By distributing the training workload across multiple GPUs on multiple nodes, multinode setups can significantly accelerate the training process. In this blog post we will show you, step-by step, how to set-up and fine-tune a Stable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI) Kubernetes Engine (OKE) on AMD GPUs using ROCm.

Read more ...


Enhancing vLLM Inference on AMD GPUs

11 October, 2024 by Clint Greene.

Read more ...


Supercharging JAX with Triton Kernels on AMD GPUs

Ready to supercharge your deep learning applications on AMD GPUs? In this blog, we’ll show you how to develop a custom fused dropout activation kernel for matrices in Triton, seamlessly call it from JAX, and benchmark its performance with ROCm. This powerful combination will take your model’s performance to the next level.

Read more ...


Leaner LLM Inference with INT8 Quantization on AMD GPUs using PyTorch

With the scale of large language models (LLMs) reaching hundred of billions of parameters, the ways we represent data within these enormous models dramatically impacts the resources required to train them (e.g. the number of GPUs needed for inference). In our previous blogs (JAX mixed precision training; PyTorch AMP), we already demonstrated how mixed precision training can accelerate LLMs training process. In this blog post we will push things further and show you how quantization into an even lower precision data formats can speed up inference, saving time and memory, without sacrificing the overall performance of the model. Quantization is a technique where the precision of a model’s parameters is reduced from a 32-bit floating point (FP32) or a 16-bit floating point (FP16) to an 8-bit integer (INT8). Standard models typically use 32-bit floating-point (FP32) precision. However, this higher precision is not always necessary for inference tasks. By converting model weights and activations to lower precision formats like INT8 (8-bit integer), we can achieve faster computations and lower memory usage, effectively reducing the model size by three-fourths (from 32-bit) or half (from 16-bit) with only a slight accuracy reduction, which is often outweighed by the speed gains.

Read more ...


Fine-tuning Llama 3 with Axolotl using ROCm on AMD GPUs

Large Language Models (LLMs) have revolutionized the field of natural language processing, enabling machines to understand and generate human-like language. However, these models are often trained on vast amounts of general-purpose data, which can make them less effective for specific tasks or domains. Fine-tuning involves training a pre-trained LLM on a specialized dataset to enhance its performance on specific tasks. As Andrej Karpathy analogized, this process is akin to allowing someone to practice a particular skill. Just as a person might need to practice a skill in a specific context to become proficient, an LLM needs to be fine-tuned on a specific dataset to become proficient in a particular task. For instance, an LLM can be fine-tuned for tasks such as financial forecasting, technical support, legal advising, medical diagnosis, or even instruction following. By fine-tuning an LLM, organizations can achieve better results and improve information security by limiting the exposure of sensitive data.

Read more ...


Inferencing and serving with vLLM on AMD GPUs

19 September, 2024 by Clint Greene.

Read more ...


Getting to Know Your GPU: A Deep Dive into AMD SMI

17 Sep, 2024 by Matt Elliott

Read more ...


Introducing the AMD ROCm™ Offline Installer Creator: Simplifying Deployment for AI and HPC

Document headings start at H2, not H1 [myst.header]

Read more ...


Optimize GPT Training: Enabling Mixed Precision Training in JAX using ROCm on AMD GPUs

This blog builds on the nanoGPT model we discussed in A Guide to Implementing and Training Generative Pre-trained Transformers (GPT) in JAX on AMD GPUs. Here we will show you how to incorporate mixed precision training to the JAX-implemented nanoGPT model we discussed in our previous blog.

Read more ...


Image Classification with BEiT, MobileNet, and EfficientNet using ROCm on AMD GPUs

Image classification is a key task in computer vision aiming at “understanding” an entire image. The outcome of an image classifier is a label or a category for the image as a whole, unlike object recognition where the task is to detect and classify multiple objects within an image.

Read more ...


Seismic stencil codes - part 3

12 Aug, 2024 by Justin Chang and Ossian O’Reilly.

Read more ...


Seismic stencil codes - part 2

12 Aug, 2024 by Justin Chang and Ossian O’Reilly.

Read more ...


Seismic stencil codes - part 1

12 Aug, 2024 by Justin Chang and Ossian O’Reilly.

Read more ...


Benchmarking Machine Learning using ROCm and AMD GPUs: Reproducing Our MLPerf Inference Submission

Measuring the performance of new technologies is as old as human history, and often as intriguing (consider for example that we still compare the performance of new electric vehicle motors using horsepower). In the rapidly advancing field of machine learning (ML) MLPerf was established by MLCommons on May 2nd 2018 and quickly became the golden standard of measuring the accuracy, speed, and efficiency of AI. MLPerf provides benchmarks on training, HPC and Inference performance. Companies across the industry use MLPerf submissions to evaluate the performance of various GPUs and software platforms, and make their technology adoption decisions based on these results.

Read more ...


Performing natural language processing tasks with LLMs on ROCm running on AMD GPUs

In this blog you will learn how to use ROCm, running on AMD’s Instinct GPUs, for a range of popular and useful natural language processing (NLP) tasks, using different large language models (LLMs). The blog includes a simple to follow hands-on guide that shows you how to implement LLMs for core NLP applications ranging from text generation and sentiment analysis to extractive question answering (QA), and solving a math problem.

Read more ...


Using AMD GPUs for Enhanced Time Series Forecasting with Transformers

Time series forecasting (TSF) is a key concept in fields such as signal processing, data science, and machine learning (ML). TSF involves predicting future behavior of a system by analyzing its past temporal patterns, using historical data to forecast future data points. Classical approaches to TSF relied on a variety of statistical methods. Recently, machine learning techniques have been increasingly used for TSF, generating discussions within the community about whether these modern approaches outperform the classical statistical ones (see: Are Transformers Effective for Time Series Forecasting? and Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)).

Read more ...


Inferencing with Grok-1 on AMD GPUs

We demonstrate that the massive Grok-1 model from xAI can run seamlessly on the AMD MI300X GPU accelerator by leveraging the ROCm software platform.

Read more ...


Optimizing RoBERTa: Fine-Tuning with Mixed Precision on AMD

In this blog we explore how to fine-tune the Robustly Optimized BERT Pretraining Approach (RoBERTa) large language model, with emphasis on PyTorch’s mixed precision capabilities. Specifically, we explore using AMD GPUs for mixed precision fine-tuning to achieve faster model training without any major impacts on accuracy.

Read more ...


Graph analytics on AMD GPUs using Gunrock

Graphs and graph analytics are related concepts that can help us understand complex data and relationships. In this context, a graph is a mathematical model that represents entities (called nodes or vertices) and their connections (called edges or links). And graph analytics is a form of data analysis that uses graph structures and algorithms to reveal insights from the data.

Read more ...


Using statistical methods to reliably compare algorithm performance in large generative AI models with JAX Profiler on AMD GPUs

This blog provides a comprehensive guide on measuring and comparing the performance of various algorithms in a JAX-implemented generative AI model. Leveraging the JAX Profiler and statistical analysis, this blog demonstrates how to reliably evaluate key steps and compare algorithm performance on AMD GPUs.

Read more ...


DBRX Instruct on AMD GPUs

In this blog, we showcase DBRX Instruct, a mixture-of-experts large language model developed by Databricks, on a ROCm-capable system with AMD GPUs.

Read more ...


Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm

PyTorch 2.0 introduces torch.compile(), a tool to vastly accelerate PyTorch code and models. By converting PyTorch code into highly optimized kernels, torch.compile delivers substantial performance improvements with minimal changes to the existing codebase. This feature allows for precise optimization of individual functions, entire modules, and complex training loops, providing a versatile and powerful tool for enhancing computational efficiency.

Read more ...


Accelerating models on ROCm using PyTorch TunableOp

In this blog, we will show how to leverage PyTorch TunableOp to accelerate models using ROCm on AMD GPUs. We will discuss the basics of General Matrix Multiplications (GEMMs), show an example of tuning a single GEMM, and finally, demonstrate real-world performance gains on an LLM (gemma) using TunableOp.

Read more ...


A Guide to Implementing and Training Generative Pre-trained Transformers (GPT) in JAX on AMD GPUs

2 July, 2024 by Douglas Jia.

Read more ...


Mamba on AMD GPUs with ROCm

28, Jun 2024 by Sean Song, Jassani Adeem, Moskvichev Arseny.

Read more ...


Deep Learning Recommendation Models on AMD GPUs

28, June 2024 by Phillip Dang.

Read more ...


Fine-tuning and Testing Cutting-Edge Speech Models using ROCm on AMD GPUs

AI Voice agents, or voice bots, are designed to communicate with people using a spoken language. Voice bots are commonly deployed in customer service and personal assistant applications, and have the potential to enter and revolutionize almost every aspect of people’s interaction with technology that can benefit from the use of voice. Automatic Speech Recognition (ASR), the technology that processes human speech into text, is essential for the creation of AI Voice agents. In this blog post we will provide you with a hands-on introduction to the deployment of three machine learning ASR models, using ROCm on AMD GPUs.

Read more ...


TensorFlow Profiler in practice: Optimizing TensorFlow models on AMD GPUs

TensorFlow Profiler consists of a set of tools designed to measure resource utilization and performance during the execution of TensorFlow models. It offers insights into how a model interacts with hardware resources, including execution time and memory usage. TensorFlow Profiler helps in pinpointing performance bottlenecks, allowing us to fine-tune the execution of models for improved efficiency and faster outcomes which can be crucial in scenarios where near-real-time predictions are required.

Read more ...


Stone Ridge Expands Reservoir Simulation Options with AMD Instinct™ Accelerators

Stone Ridge Technology (SRT) pioneered the use of GPUs for high performance reservoir simulation (HPC) nearly a decade ago with ECHELON, its flagship software product. ECHELON, the first of its kind, engineered from the outset to harness the full potential of massively parallel GPUs, stands apart in the industry for its power, efficiency, and accuracy. Now, ECHELON has added support for AMDInstinct accelerators into its simulation engine, offering new flexibility and optionality to its clients.

Read more ...


Segment Anything with AMD GPUs

4 Jun, 2024 by Sean Song.

Read more ...


SmoothQuant model inference on AMD Instinct MI300X using Composable Kernel

The AMD ROCm™ Composable Kernel (CK) library provides a programming model for writing performance-critical kernels for machine learning workloads. It generates a general-purpose kernel during the compilation phase through a C++ template, enabling developers to achieve operation fusions on different data precisions.

Read more ...


Unveiling performance insights with PyTorch Profiler on an AMD GPU

29 May, 2024 by Phillip Dang.

Read more ...


Panoptic segmentation and instance segmentation with Detectron2 on AMD GPUs

23, May 2024 by Vara Lakshmi Bayanagari.

Read more ...


Siemens taps AMD Instinct™ GPUs to expand high-performance hardware options for Simcenter STAR-CCM+

Siemens recently announced that its Simcenter STAR-CCM+ multi-physics computational fluid dynamics (CFD) software now supports AMD Instinct™ GPUs for GPU-native computation. This move addresses its users’ needs for computational efficiency, reduced simulation costs and energy usage, and greater hardware choice.

Read more ...


AMD Collaboration with the University of Michigan offers High Performance Open-Source Solutions to the Bioinformatics Community

Long read DNA sequencing technology is revolutionizing genetic diagnostics and precision medicine by helping us discover structural variants and assemble whole genomes. It also helps us study evolutionary relationships. Lower sequencing costs and high-throughput portable long read sequencers are revolutionizing precision medicine today. Long read sequencers from the top manufacturers including Oxford Nanopore (ONT) and PacBio, can produce reads that are much longer than previous generations of sequencers. However, long reads vary in length and are significantly more error prone than short reads. Sequence alignment (on CPUs) is one of the main bottlenecks in long read processing workflows.

Read more ...


Accelerating Large Language Models with Flash Attention on AMD GPUs

15, May 2024 by Clint Greene.

Read more ...


Reading AMD GPU ISA

For an application developer it is often helpful to read the Instruction Set Architecture (ISA) for the GPU architecture that is used to perform its computations. Understanding the instructions of the pertinent code regions of interest can help in debugging and achieving performance optimization of the application.

Read more ...


AMD in Action: Unveiling the Power of Application Tracing and Profiling

Rocprof is a robust tool designed to analyze and optimize the performance of HIP programs on AMD ROCm platforms, helping developers pinpoint and resolve performance bottlenecks. Rocprof provides a variety of profiling data, including performance counters, hardware traces, and runtime API/activity traces.

Read more ...


Step-by-Step Guide to Use OpenLLM on AMD GPUs

OpenLLM is an open-source platform designed to facilitate the deployment and utilization of large language models (LLMs), supporting a wide range of models for diverse applications, whether in cloud environments or on-premises. In this tutorial, we will guide you through the process of starting an LLM server using OpenLLM, enabling interaction with the server from your local machine, with special emphasis on leveraging the capabilities of AMD GPUs.

Read more ...


Inferencing with Mixtral 8x22B on AMD GPUs

1, May 2024 by Clint Greene.

Read more ...


Training a Neural Collaborative Filtering (NCF) Recommender on an AMD GPU

30, Apr 2024 by Vara Lakshmi Bayanagari.

Read more ...


Table Question-Answering with TaPas

26 Apr, 2024 by Phillip Dang.

Read more ...


Multimodal (Visual and Language) understanding with LLaVA-NeXT

26, Apr 2024 by Phillip Dang.

Read more ...


Application portability with HIP

Many scientific applications run on AMD-equipped computing platforms and supercomputers, including Frontier, the first Exascale system in the world. These applications, coming from a myriad of science domains, were ported to run on AMD GPUs using the Heterogeneous-compute Interface for Portability (HIP) abstraction layer. HIP enables these High-Performance Computing (HPC) facilities to transition their CUDA codes to run and take advantage of the latest AMD GPUs. The effort involved in porting these scientific applications varies from a few hours to a few weeks and largely depends on the complexity of the original source code. Figure 1 shows several examples of applications that have been ported and the corresponding porting effort.

Read more ...


Unlocking Vision-Text Dual-Encoding: Multi-GPU Training of a CLIP-Like Model

24 Apr, 2024 by Sean Song.

Read more ...


Transforming Words into Motion: A Guide to Video Generation with AMD GPU

24 Apr, 2024 by Douglas Jia.

Read more ...


C++17 parallel algorithms and HIPSTDPAR

The C++17 standard added the concept of parallel algorithms to the pre-existing C++ Standard Library. The parallel version of algorithms like std::transform maintain the same signature as the regular serial version, except for the addition of an extra parameter specifying the execution policy to use. This flexibility allows users that are already using the C++ Standard Library algorithms to take advantage of multi-core architectures by just introducing minimal changes to their code.

Read more ...


Inferencing with AI2’s OLMo model on AMD GPU

17 Apr, 2024 by Douglas Jia.

Read more ...


Text Summarization with FLAN-T5

16, Apr 2024 by Phillip Dang.

Read more ...


Speech-to-Text on an AMD GPU with Whisper

16 Apr, 2024 by Clint Greene.

Read more ...


PyTorch C++ Extension on AMD GPU

16, Apr 2024 by Vara Lakshmi Bayanagari.

Read more ...


Programming AMD GPUs with Julia

Julia is a high-level, general-purpose dynamic programming language that automatically compiles to efficient native code via LLVM, and supports multiple platforms. With LLVM, comes the support for programming GPUs, including AMD GPUs.

Read more ...


Program Synthesis with CodeGen

16, Apr 2024 by Phillip Dang.

Read more ...


Interacting with Contrastive Language-Image Pre-Training (CLIP) model on AMD GPU

16, Apr 2024 by Sean Song.

Read more ...


Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs

16 Apr, 2024 by Douglas Jia.

Read more ...


Affinity part 2 - System topology and controlling affinity

In Part 1 of the Affinity blog series, we looked at the importance of setting affinity for High Performance Computing (HPC) workloads. In this blog post, our goals are the following:

Read more ...


Affinity part 1 - Affinity, placement, and order

Modern hardware architectures are increasingly complex with multiple sockets, many cores in each Central Processing Unit (CPU), Graphical Processing Units (GPUs), memory controllers, Network Interface Cards (NICs), etc. Peripherals such as GPUs or memory controllers will often be local to a CPU socket. Such designs present interesting challenges in optimizing memory access times, data transfer times, etc. Depending on how the system is built, hardware components are connected, and the workload being run, it may be advantageous to use the resources of the system in a specific way. In this article, we will discuss the role of affinity, placement, and order in improving performance for High Performance Computing (HPC) workloads. A short case study is also presented to familiarize you with performance considerations on a node in the Frontier supercomputer. In a follow-up article, we also aim to equip you with the tools you need to understand your system’s hardware topology and set up affinity for your application accordingly.

Read more ...


Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU

15, Apr 2024 by Sean Song.

Read more ...


Developing Triton Kernels on AMD GPUs

15 Apr, 2024 by Clint Greene.

Read more ...


GPU Unleashed: Training Reinforcement Learning Agents with Stable Baselines3 on an AMD GPU in Gymnasium Environment

11 Apr, 2024 by Douglas Jia.

Read more ...


ResNet for image classification using AMD GPUs

9 Apr, 2024 by Logan Grado.

Read more ...


Small language models with Phi-2

8, Apr 2024 by Phillip Dang.

Read more ...


Using the ChatGLM-6B bilingual language model with AMD GPUs

4, Apr 2024 by Phillip Dang.

Read more ...


Total body segmentation using MONAI Deploy on an AMD GPU

4, Apr 2024 by Vara Lakshmi Bayanagari.

Read more ...


Retrieval Augmented Generation (RAG) using LlamaIndex

4, Apr 2024 by Clint Greene.

Read more ...


Image classification using Vision Transformer with AMD GPUs

4 Apr, 2024 by Eliot Li.

Read more ...


Building semantic search with SentenceTransformers on AMD

4 Apr, 2024 by Fabricio Flores.

Read more ...


Scale AI applications with Ray

1, Apr 2024 by Vicky Tsang<vicktsan>, {hoverxref}Logan Grado, {hoverxref}Eliot Li.

Read more ...


Automatic mixed precision in PyTorch using AMD GPUs

As models increase in size, the time and memory needed to train them–and consequently, the cost–also increases. Therefore, any measures we take to reduce training time and memory usage can be highly beneficial. This is where Automatic Mixed Precision (AMP) comes in.

Read more ...


Large language model inference optimizations on AMD GPUs

15, Mar 2024 by Seungrok Jung.

Read more ...


Building a decoder transformer model on AMD GPU(s)

12, Mar 2024 by Phillip Dang.

Read more ...


Question-answering Chatbot with LangChain on an AMD GPU

11, Mar 2024 by Phillip Dang.

Read more ...


Music Generation With MusicGen on an AMD GPU

8, Mar 2024 by Phillip Dang.

Read more ...


Efficient image generation with Stable Diffusion models and ONNX Runtime using AMD GPUs

23 Feb, 2024 by Douglas Jia.

Read more ...


Simplifying deep learning: A guide to PyTorch Lightning

8, Feb 2024 by Phillip Dang.

Read more ...


Two-dimensional images to three-dimensional scene mapping using NeRF on an AMD GPU

7, Feb 2024 by Vara Lakshmi Bayanagari.

Read more ...


Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering

1, Feb 2024 by Sean Song.

Read more ...


Pre-training BERT using Hugging Face & TensorFlow on an AMD GPU

29, Jan 2024 by Vara Lakshmi Bayanagari.

Read more ...


Pre-training BERT using Hugging Face & PyTorch on an AMD GPU

26, Jan 2024 by Vara Lakshmi Bayanagari.

Read more ...


Accelerating XGBoost with Dask using multiple AMD GPUs

26 Jan, 2024 by Clint Greene.

Read more ...


LLM distributed supervised fine-tuning with JAX

25 Jan, 2024 by Douglas Jia.

Read more ...


Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs

24 Jan, 2024 by Douglas Jia.

Read more ...


Efficient image generation with Stable Diffusion models and AITemplate using AMD GPUs

24 Jan, 2024 by Douglas Jia.

Read more ...


Efficient deployment of large language models with Text Generation Inference on AMD GPUs

24 Jan, 2024 by Douglas Jia.

Read more ...


Sparse matrix vector multiplication - part 1

3 Nov, 2023 by Paul Mullowney.

Read more ...


Jacobi Solver with HIP and OpenMP offloading

15 Sept, 2023 by Asitav Mishra, Rajat Arora, Justin Chang.

Read more ...


Creating a PyTorch/TensorFlow code environment on AMD GPUs

Goal: The machine learning ecosystem is quickly exploding and we aim to make porting to AMD GPUs simple with this series of machine learning blogposts.

Read more ...


Finite difference method - Laplacian part 4

18 Jul, 2023 by Justin Chang, Thomas Gibson, Sean Miller.

Read more ...


GPU-aware MPI with ROCm

MPI is the de facto standard for inter-process communication in High-Performance Computing. MPI processes compute on their local data while extensively communicating with each other. This enables MPI programs to be executed on systems with a distributed memory space e.g. clusters. There are different types of communications supported in MPI including point-to-point and collective communications. Point-to-point communication is the basic communication mechanism in which both the sending process and the receiving process take part in the communication. The sender has a buffer that holds the message and an envelope containing information that will be used by the receiver side (e.g., message tag, the sender rank number, etc.). The receiver uses the information in the envelope to select the specified message and stores it in its receiver buffer. In collective communication, messages can be exchanged among a group of processes rather than just two of them. Collective communication provides opportunities for processes to perform one-to-many and many-to-many communications in a convenient, portable and optimized way. Some examples of collective communications include broadcast, allgather, alltoall, and allreduce.

Read more ...


Register pressure in AMD CDNA™2 GPUs

Register pressure in GPU kernels has a tremendous impact on the overall performance of your HPC application. Understanding and controlling register usage allows developers to carefully design codes capable of maximizing hardware resources. The following blog post is focused on a practical demo showing how to apply the recommendations explained in this OLCF training talk presented on August 23rd 2022. Here is the training archive where you can also find the slides. We focus solely on the AMD CDNA™2 architecture (MI200 series GPUs) using ROCm 5.4.

Read more ...


Finite difference method - Laplacian part 3

11 May, 2023 by Justin Chang, Rajat Arora, Thomas Gibson, Sean Miller, Ossian O’Reilly.

Read more ...


Introduction to profiling tools for AMD hardware

Getting a code to be functionally correct is not always enough. In many industries, it is also required that applications and their complex software stack run as efficiently as possible to meet operational demands. This is particularly challenging as hardware continues to evolve over time, and as a result codes may require further tuning. In practice, many application developers construct benchmarks, which are carefully designed to measure the performance, such as execution time, of a particular code within an operational-like setting. In other words: a good benchmark should be representative of the real work that needs to be done. These benchmarks are useful in that they provide insight into the characteristics of the application, and enables one to discover potential bottlenecks that could result in performance degradation during operational settings.

Read more ...


AMD Instinct™ MI200 GPU memory space overview

The HIP API supports a wide variety of allocation methods for host and device memory on accelerated systems. In this post, we will:

Read more ...


AMD ROCm™ installation

AMD ROCm™ is the first open-source software development platform for HPC/Hyperscale-class GPU computing. AMD ROCm™ brings the UNIX philosophy of choice, minimalism and modular software development to GPU computing. Please see the AMD Open Software Platform for GPU Compute and ROCm Informational Portal pages for more information.

Read more ...


Finite difference method - Laplacian part 2

4 Jan, 2023 by Justin Chang, Rajat Arora, Thomas Gibson, Sean Miller, Ossian O’Reilly.

Read more ...


Finite difference method - Laplacian part 1

14 Nov, 2022 by Justin Chang, Rajat Arora, Thomas Gibson, Sean Miller, Ossian O’Reilly.

Read more ...


AMD matrix cores

Matrix multiplication is a fundamental aspect of linear algebra and it is an ubiquitous computation within High Performance Computing (HPC) Applications. Since the introduction of AMD’s CDNA Architecture, Generalized Matrix Multiplication (GEMM) computations are now hardware-accelerated through Matrix Core Processing Units. Matrix Core accelerated GEMM kernels lie at the heart of BLAS libraries like rocBLAS but they can also be programmed directly by developers. Applications that are throughput bound by GEMM computation can achieve additional speedups by utilizing Matrix Cores.

Read more ...