Posts in English

Panoptic segmentation and instance segmentation with Detectron2 on AMD GPUs

This blog gives an overview of Detectron2 and the inference of segmentation pipelines in its core library on an AMD GPU.

Read more ...


Siemens taps AMD Instinct™ GPUs to expand high-performance hardware options for Simcenter STAR-CCM+

Siemens recently announced that its Simcenter STAR-CCM+ multi-physics computational fluid dynamics (CFD) software now supports AMD Instinct™ GPUs for GPU-native computation. This move addresses its users’ needs for computational efficiency, reduced simulation costs and energy usage, and greater hardware choice.

Read more ...


AMD Collaboration with the University of Michigan offers High Performance Open-Source Solutions to the Bioinformatics Community

Long read DNA sequencing technology is revolutionizing genetic diagnostics and precision medicine by helping us discover structural variants and assemble whole genomes. It also helps us study evolutionary relationships. Lower sequencing costs and high-throughput portable long read sequencers are revolutionizing precision medicine today. Long read sequencers from the top manufacturers including Oxford Nanopore (ONT) and PacBio, can produce reads that are much longer than previous generations of sequencers. However, long reads vary in length and are significantly more error prone than short reads. Sequence alignment (on CPUs) is one of the main bottlenecks in long read processing workflows.

Read more ...


Accelerating Large Language Models with Flash Attention on AMD GPUs

In this blog post, we will guide you through the process of installing Flash Attention on AMD GPUs and provide benchmarks comparing its performance to standard SDPA in PyTorch. We will also measure end-to-end prefill latency for multiple Large Language Models (LLMs) in Hugging Face.

Read more ...


Reading AMD GPU ISA

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


AMD in Action: Unveiling the Power of Application Tracing and Profiling

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Step-by-Step Guide to Use OpenLLM on AMD GPUs

OpenLLM is an open-source platform designed to facilitate the deployment and utilization of large language models (LLMs), supporting a wide range of models for diverse applications, whether in cloud environments or on-premises. In this tutorial, we will guide you through the process of starting an LLM server using OpenLLM, enabling interaction with the server from your local machine, with special emphasis on leveraging the capabilities of AMD GPUs.

Read more ...


Inferencing with Mixtral 8x22B on AMD GPUs

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Training a Neural Collaborative Filtering (NCF) Recommender on an AMD GPU

Collaborative Filtering is a type of item recommendation where new items are recommended to the user based on their past interactions. Neural Collaborative Filtering (NCF) is a recommendation system that uses neural network to model the user-item interaction function. NCF focuses on optimizing a collaborative function, which is essentially a user-item interaction model represented by a neural network and ranks the recommended items for the user.

Read more ...


Table Question-Answering with TaPas

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Multimodal (Visual and Language) understanding with LLaVA-NeXT

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Application portability with HIP

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Unlocking Vision-Text Dual-Encoding: Multi-GPU Training of a CLIP-Like Model

In this blog, we will build a vision-text dual encoder model akin to CLIP and fine-tune it with the COCO dataset on AMD GPU with ROCm. This work is inspired by the principles of CLIP and the Hugging Face example. The idea is to train a vision encoder and a text encoder jointly to project the representation of images and their descriptions into the same embedding space, such that the text embeddings are located near the embeddings of the images they describe. The objective during training is to maximize the similarity between the embeddings of image and text pairs in the batch while minimizing the similarity of embeddings for incorrect pairs. The model achieves this by learning a multimodal embedding space. A symmetric cross entropy loss is optimized over these similarity scores.

Read more ...


Transforming Words into Motion: A Guide to Video Generation with AMD GPU

This blog introduces the advancements in text-to-video generation through enhancements to the stable diffusion model and demonstrates the process of generating videos from text prompts on an AMD GPU using Alibaba’s ModelScopeT2V model.

Read more ...


C++17 parallel algorithms and HIPSTDPAR

The C++17 standard added the concept of parallel algorithms to the pre-existing C++ Standard Library. The parallel version of algorithms like std::transform maintain the same signature as the regular serial version, except for the addition of an extra parameter specifying the execution policy to use. This flexibility allows users that are already using the C++ Standard Library algorithms to take advantage of multi-core architectures by just introducing minimal changes to their code.

Read more ...


Inferencing with AI2’s OLMo model on AMD GPU

In this blog, we will show you how to generate text using AI2’s OLMo model on AMD GPU.

Read more ...


Text Summarization with FLAN-T5

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Speech-to-Text on an AMD GPU with Whisper

Whisper is an advanced automatic speech recognition (ASR) system, developed by OpenAI. It employs a straightforward encoder-decoder Transformer architecture where incoming audio is divided into 30-second segments and subsequently fed into the encoder. The decoder can be prompted with special tokens to guide the model to perform tasks such as language identification, transcription, and translation.

Read more ...


PyTorch C++ Extension on AMD GPU

This blog demonstrates how to use the PyTorch C++ extension with an example and discusses its advantages over regular PyTorch modules. The experiments were carried out on AMD GPUs and ROCm 5.7.0 software. For more information about supported GPUs and operating systems, see System Requirements (Linux).

Read more ...


Programming AMD GPUs with Julia

Julia is a high-level, general-purpose dynamic programming language that automatically compiles to efficient native code via LLVM, and supports multiple platforms. With LLVM, comes the support for programming GPUs, including AMD GPUs.

Read more ...


Program Synthesis with CodeGen

CodeGen is a family of standard transformer-based auto-regressive language models for program synthesis, which as defined by the authors as a method for generating computer programs that solve specified problems, using input-output examples or natural language descriptions.

Read more ...


Interacting with Contrastive Language-Image Pre-Training (CLIP) model on AMD GPU

Contrastive Language-Image Pre-Training (CLIP) is a multimodal deep learning model that bridges vision and natural language. It was introduced in the paper “Learning Transferrable Visual Models from Natural Language Supervision” (2021) from OpenAI, and it was trained contrastively on a huge amount (400 million) of web scraped data of image-caption pairs (one of the first models to do this).

Read more ...


Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs

In this blog, we will show you how to fine-tune the StarCoder base model on AMD GPUs with an instruction-answer pair dataset so that it can follow instructions to generate code and answer questions. We will also show you how to use parameter-efficient fine-tuning (PEFT) to minimize the computation cost for the fine-tuning process.

Read more ...


Affinity part 2 - System topology and controlling affinity

In Part 1 of the Affinity blog series, we looked at the importance of setting affinity for High Performance Computing (HPC) workloads. In this blog post, our goals are the following:

Read more ...


Affinity part 1 - Affinity, placement, and order

Modern hardware architectures are increasingly complex with multiple sockets, many cores in each Central Processing Unit (CPU), Graphical Processing Units (GPUs), memory controllers, Network Interface Cards (NICs), etc. Peripherals such as GPUs or memory controllers will often be local to a CPU socket. Such designs present interesting challenges in optimizing memory access times, data transfer times, etc. Depending on how the system is built, hardware components are connected, and the workload being run, it may be advantageous to use the resources of the system in a specific way. In this article, we will discuss the role of affinity, placement, and order in improving performance for High Performance Computing (HPC) workloads. A short case study is also presented to familiarize you with performance considerations on a node in the Frontier supercomputer. In a follow-up article, we also aim to equip you with the tools you need to understand your system’s hardware topology and set up affinity for your application accordingly.

Read more ...


Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU

Building on the previous blog Fine-tune Llama 2 with LoRA blog, we delve into another Parameter Efficient Fine-Tuning (PEFT) approach known as Quantized Low Rank Adaptation (QLoRA). The focus will be on leveraging QLoRA for the fine-tuning of Llama-2 7B model using a single AMD GPU with ROCm. This task, made possible through the use of QLoRA, addresses challenges related to memory and computing limitations. The exploration aims to showcase how QLoRA can be employed to enhance accessibility to open-source large language models.

Read more ...


Developing Triton Kernels on AMD GPUs

OpenAI has developed a powerful GPU focused programming language and compiler called Triton that works seamlessly with AMD GPUs. The goal of Triton is to enable AI engineers and scientists to write high-performant GPU code with minimal expertise. Triton kernels are performant because of their blocked program representation, allowing them to be compiled into highly optimized binary code. Triton also leverages Python for kernel development, making it both familiar and accessible. And the kernels can be easily compiled by simply declaring the triton.jit python decorator before the kernel.

Read more ...


GPU Unleashed: Training Reinforcement Learning Agents with Stable Baselines3 on an AMD GPU in Gymnasium Environment

This blog will delve into the fundamentals of deep reinforcement learning, guiding you through a practical code example that utilizes an AMD GPU to train a Deep Q-Network (DQN) policy within the Gymnasium environment.

Read more ...


ResNet for image classification using AMD GPUs

In this blog, we demonstrate training a simple ResNet model for image classification on AMD GPUs using ROCm on the CIFAR10 dataset. Training a ResNet model on AMD GPUs is simple, requiring no additional work beyond installing ROCm and appropriate PyTorch libraries.

Read more ...


Small language models with Phi-2

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Using the ChatGLM-6B bilingual language model with AMD GPUs

ChatGLM-6B is an open bilingual (Chinese-English) language model with 6.2 billion parameters. It’s optimized for Chinese conversation based on General Language Model (GLM) architecture. GLM is a pretraining framework that seeks to combine the strengths of autoencoder models (like BERT) and autoregressive models (like GPT). The GLM framework randomly blanks out continuous spans of tokens from the input text (autoencoding methodology) and trains the model to sequentially reconstruct the spans (autoregressive pretraining methodology).

Read more ...


Total body segmentation using MONAI Deploy on an AMD GPU

Medical Open Network for Artificial Intelligence (MONAI) is an open-source organization that provides PyTorch implementation of state-of-the-art medical imaging models, ranging from classification and segmentation to image generation. Catering to the needs of researchers, clinicians, and fellow domain contributors, MONAI’s lifecycle provides three different end-to-end workflow tools: MONAI Core, MONAI Label, and MONAI Deploy.

Read more ...


Retrieval Augmented Generation (RAG) using LlamaIndex

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Inferencing and serving with vLLM on AMD GPUs

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Image classification using Vision Transformer with AMD GPUs

The Vision Transformer (ViT) model was first proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ViT is an attractive alternative to conventional Convolutional Neural Network (CNN) models due to its excellent scalability and adaptability in the field of computer vision. On the other hand, ViT can be more expensive compared to CNN for large input images as it has quadratic computation complexity with respect to input size.

Read more ...


Building semantic search with SentenceTransformers on AMD

In this blog, we explain how to train a SentenceTransformers model on the Sentence Compression dataset to perform semantic search. We use the BERT base model (uncased) as the base transformer and apply Hugging Face PyTorch libraries.

Read more ...


Scale AI applications with Ray

Most machine-learning (ML) workloads today require multiple GPUs or nodes to achieve the performance or scale required by applications. However, scaling workloads beyond single node/single GPU workloads is difficult and require some expertise in distributed processing.

Read more ...


Automatic mixed precision in PyTorch using AMD GPUs

As models increase in size, the time and memory needed to train them–and consequently, the cost–also increases. Therefore, any measures we take to reduce training time and memory usage can be highly beneficial. This is where Automatic Mixed Precision (AMP) comes in.

Read more ...


Large language model inference optimizations on AMD GPUs

Large language models (LLMs) have transformed natural language processing and comprehension, facilitating a multitude of AI applications in diverse fields. LLMs have various promising use cases, including AI assistants, chatbots, programming, gaming, learning, searching, and recommendation systems. These applications leverage the capabilities of LLMs to provide personalized and interactive experiences, which enhances user engagement.

Read more ...


Building a decoder transformer model on AMD GPU(s)

In this blog, we demonstrate how to run Andrej Karpathy’s beautiful PyTorch re-implementation of GPT on single and multiple AMD GPUs on a single node using PyTorch 2.0 and ROCm. We use the works of Shakespeare to train our model, then run inference to see if our model can generate Shakespeare-like text.

Read more ...


Question-answering Chatbot with LangChain on an AMD GPU

LangChain is a framework designed to harness the power of language models for building cutting-edge applications. By connecting language models to various contextual sources and providing reasoning abilities based on the given context, LangChain creates context-aware applications that can intelligently reason and respond. In this blog, we demonstrate how to use LangChain and Hugging Face to create a simple question-answering chatbot. We also demonstrate how to augment our large language model (LLM) knowledge with additional information using the Retrieval Augmented Generation (RAG) technique, then allow our bot to respond to queries based on the information contained within specified documents.

Read more ...


Music Generation With MusicGen on an AMD GPU

MusicGen is an autoregressive, transformer-based model that predicts the next segment of a piece of music based on previous segments. This is a similar approach to language models predicting the next token.

Read more ...


Efficient image generation with Stable Diffusion models and ONNX Runtime using AMD GPUs

In this blog, we show you how to use pre-trained Stable Diffusion models to generate images from text (text-to-image), transform existing visuals (image-to-image), and restore damaged pictures (inpainting) on AMD GPUs using ONNX Runtime.

Read more ...


Simplifying deep learning: A guide to PyTorch Lightning

PyTorch Lightning is a higher-level wrapper built on top of PyTorch. Its purpose is to simplify and abstract the process of training PyTorch models. It provides a structured and organized approach to machine learning (ML) tasks by abstracting away the repetitive boilerplate code, allowing you to focus more on model development and experimentation. PyTorch Lightning works out-of-the-box with AMD GPUs and ROCm.

Read more ...


Two-dimensional images to three-dimensional scene mapping using NeRF on an AMD GPU

This tutorial aims to explain the fundamentals of NeRF and its implementation in PyTorch. The code used in this tutorial is inspired by Mason McGough’s colab notebook and is implemented on an AMD GPU.

Read more ...


Using LoRA for efficient fine-tuning: Fundamental principles

Low-Rank Adaptation of Large Language Models (LoRA) is used to address the challenges of fine-tuning large language models (LLMs). Models like GPT and Llama, which boast billions of parameters, are typically cost-prohibitive to fine-tune for specific tasks or domains. LoRA preserves pre-trained model weights and incorporates trainable layers within each model block. This results in a significant reduction in the number of parameters that need to be fine-tuned and considerably reduces GPU memory requirements. The key benefit of LoRA is that it substantially decreases the number of trainable parameters–sometimes by a factor of up to 10,000–leading to a considerable decrease in GPU resource demands.

Read more ...


Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering

In this blog, we show you how to fine-tune Llama 2 on an AMD GPU with ROCm. We use Low-Rank Adaptation of Large Language Models (LoRA) to overcome memory and computing limitations and make open-source large language models (LLMs) more accessible. We also show you how to fine-tune and upload models to Hugging Face.

Read more ...


Pre-training BERT using Hugging Face & TensorFlow on an AMD GPU

This blog explains an end-to-end process for pre-training the Bidirectional Encoder Representations from Transformers (BERT) base model from scratch using Hugging Face libraries with a TensorFlow backend for English corpus text (WikiText-103-raw-v1).

Read more ...


Pre-training BERT using Hugging Face & PyTorch on an AMD GPU

This blog explains an end-to-end process for pre-training the Bidirectional Encoder Representations from Transformers (BERT) base model from scratch using Hugging Face libraries with a PyTorch backend for English corpus text (WikiText-103-raw-v1).

Read more ...


Accelerating XGBoost with Dask using multiple AMD GPUs

XGBoost is an optimized library for distributed gradient boosting. It has become the leading machine learning library for solving regression and classification problems. For a deeper dive into how gradient boosting works, we recommend reading Introduction to Boosted Trees.

Read more ...


LLM distributed supervised fine-tuning with JAX

In this article, we review the process for fine-tuning a Bidirectional Encoder Representations from Transformers (BERT)-based large language model (LLM) using JAX for a text classification task. We explore techniques for parallelizing this fine-tuning procedure across multiple AMD GPUs, then evaluate our model’s performance on a holdout dataset. For this, we use a (BERT)-base-cased transformer model with a General Language Understanding Evaluation (GLUE) benchmark dataset on multiple AMD GPUs.

Read more ...


Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs

In this blog, we show you how to pre-train a GPT-3 model using the Megatron-DeepSpeed framework on multiple AMD GPUs. We also demonstrate how to perform inference on the text-generation task with your pre-trained model.

Read more ...


Efficient image generation with Stable Diffusion models and AITemplate using AMD GPUs

Stable Diffusion has emerged as a groundbreaking advancement in the field of image generation, empowering users to translate text descriptions into captivating visual output.

Read more ...


Efficient deployment of large language models with Text Generation Inference on AMD GPUs

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs) with unparalleled efficiency. TGI is tailored for popular open-source LLMs, such as Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and T5. Optimizations include tensor parallelism, token streaming using Server-Sent Events (SSE), continuous batching, and optimized transformers code. It has a robust feature set that includes quantization, safetensors, watermarking (for determining if text is generated from language models), logits warper, and support for custom prompt generation and fine-tuning.

Read more ...


Sparse matrix vector multiplication - part 1

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Jacobi Solver with HIP and OpenMP offloading

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Creating a PyTorch/TensorFlow code environment on AMD GPUs

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Finite difference method - Laplacian part 4

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


GPU-aware MPI with ROCm

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Register pressure in AMD CDNA™2 GPUs

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Finite difference method - Laplacian part 3

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Introduction to profiling tools for AMD hardware

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


AMD Instinct™ MI200 GPU memory space overview

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


AMD ROCm™ installation

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Finite difference method - Laplacian part 2

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Finite difference method - Laplacian part 1

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


AMD matrix cores

Note: This blog was previously part of the AMD lab notes blog series.

Read more ...