Posts in Applications & models
Transformer based Encoder-Decoder models for image-captioning on AMD GPUs
- 03 December 2024
Image captioning, or the GenAI-based automatic generation of concise textual descriptions of images, has immensely important real-world applications. For example, image captioning can provide visually impaired users with textual descriptions of images for improved accessibility, image captioning can add textual descriptions to products in e-commerce applications and help children map images to their textual descriptions in early childhood educational apps. Image captioning can automatically describe objects and events in security camera footage in surveillance applications and can enable robots to auto-generate textual captions for objects and events they encountered in human-robot interaction (HRI) applications, and many more applications. Image captioning is a sequence-to-sequence (seq2seq) machine learning task: a model converting a sequence from one domain (in this case, the image), to another (its textual description). In image captioning the image is partitioned into a sequence of patches. This sequence of image patches is then converted by the model to a corresponding sequence of text tokens.
Quantized 8-bit LLM training and inference using bitsandbytes on AMD GPUs
- 13 November 2024
In this blog post we will cover the bitsandbytes 8-bit representations. As you will see, the bitsandbytes 8-bit representations significantly help reduce the memory needed for fine-tuning and inferencing LLMs. There are many quantization techniques used in the field to decrease a model size, but bitsandbytes offers quantization to decrease the size of optimizer states as well. This post will help you understand the basic principles underlying the bitsandbytes 8-bit representations, explain the bitsandbytes 8-bit optimizer and LLM.int8 techniques, and show you how to implement these on AMD GPUs using ROCm.
Distributed Data Parallel Training on AMD GPU with ROCm
- 01 November 2024
With the increase in complexity and size of machine learning models, the demand for computational resources grows. Training on a single GPU can become a bottleneck for deep learning applications, especially with large datasets and models that are slow to train on a single GPU. Parallelized training addresses this challenge. Out of the various forms of parallelized training, this blog focuses on Distributed Data Parallel (DDP), a key feature in PyTorch that accelerates training across multiple GPUs and nodes.
Torchtune on AMD GPUs How-To Guide: Fine-tuning and Scaling LLMs with Multi-GPU Power
- 24 October 2024
This blog provides a thorough how-to guide on using Torchtune to fine-tune and scale large language models (LLMs) with AMD GPUs. Torchtune is a PyTorch library designed to let you easily fine-tune and experiment with LLMs. Using Torchtune’s flexibility and scalability, we show you how to fine-tune the Llama-3.1-8B model for summarization tasks using the EdinburghNLP/xsum dataset. Using LoRA(Low-Rank Adaptation), a parameter-efficient fine-tuning technique, Torchtune enables efficient training while maintaining performance across a different number of GPUs (2, 4, 6, and 8). This post also highlights how Torchtune’s distributed training capabilities allow users to scale up LLM fine-tuning on multiple GPUs to reduce training time while maintaining the quality of the trained model, demonstrating its potential and usage on modern AMD hardware using ROCm.
CTranslate2: Efficient Inference with Transformer Models on AMD GPUs
- 24 October 2024
Transformer models have revolutionized natural language processing (NLP) by delivering high-performance results in tasks like machine translation, text summarization, text generation, and speech recognition. However, deploying these models in production can be challenging due to their high computational and memory requirements. CTranslate2 addresses these challenges by providing a custom runtime that implements various optimization techniques to accelerate Transformer models during inference.
Inference with Llama 3.2 Vision LLMs on AMD GPUs Using ROCm
- 23 October 2024
Meta’s Llama models now support multimodal capabilities, expanding their functionality beyond traditional text-only applications. The Llama 3.2 models are available in a range of sizes, including medium-sized 11B and 90B multimodal models for vision-text reasoning tasks, and lightweight 1B and 3B text-only models designed for edge and mobile devices.
Speed Up Text Generation with Speculative Sampling on AMD GPUs
- 15 October 2024
As the size of transformer models grow, so does the cost of conducting inference, impacting latency and throughput. Compression methods such as quantization and distillation, as well as hardware-aware optimizations such as Flash Attention and Triton, have been proposed to cut down the computation cost at different levels. However, these models either compromise on accuracy or require major changes to the model implementation.
Multinode Fine-Tuning of Stable Diffusion XL on AMD GPUs with Hugging Face Accelerate and OCI’s Kubernetes Engine (OKE)
- 15 October 2024
As the scale and complexity of generative AI and deep learning models grow, multinode training, basically dividing a training job across several processors, has become an essential strategy to speed up training and fine-tuning processes of large generative AI models like SDXL. By distributing the training workload across multiple GPUs on multiple nodes, multinode setups can significantly accelerate the training process. In this blog post we will show you, step-by step, how to set-up and fine-tune a Stable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI) Kubernetes Engine (OKE) on AMD GPUs using ROCm.
Enhancing vLLM Inference on AMD GPUs
- 11 October 2024
In this blog, we’ll demonstrate the latest performance enhancements in vLLM inference on AMD Instinct accelerators using ROCm. In a nutshell, vLLM optimizes GPU memory utilization, allowing more efficient handling of large language models (LLMs) within existing hardware constraints, maximizing throughput and minimizing latency. We start the blog by briefly explaining how causal language models like Llama 3 and ChatGPT generate text, motivating the need to enhance throughput and reduce latency. If you’re new to vLLM, we also recommend reading our introduction to Inferencing and serving with vLLM on AMD GPUs. ROCm 6.2 introduces support for the following vLLM features which we will use in this blog post.
Supercharging JAX with Triton Kernels on AMD GPUs
- 09 October 2024
Ready to supercharge your deep learning applications on AMD GPUs? In this blog, we’ll show you how to develop a custom fused dropout activation kernel for matrices in Triton, seamlessly call it from JAX, and benchmark its performance with ROCm. This powerful combination will take your model’s performance to the next level.
Leaner LLM Inference with INT8 Quantization on AMD GPUs using PyTorch
- 03 October 2024
With the scale of large language models (LLMs) reaching hundred of billions of parameters, the ways we represent data within these enormous models dramatically impacts the resources required to train them (e.g. the number of GPUs needed for inference). In our previous blogs (JAX mixed precision training; PyTorch AMP), we already demonstrated how mixed precision training can accelerate LLMs training process. In this blog post we will push things further and show you how quantization into an even lower precision data formats can speed up inference, saving time and memory, without sacrificing the overall performance of the model. Quantization is a technique where the precision of a model’s parameters is reduced from a 32-bit floating point (FP32) or a 16-bit floating point (FP16) to an 8-bit integer (INT8). Standard models typically use 32-bit floating-point (FP32) precision. However, this higher precision is not always necessary for inference tasks. By converting model weights and activations to lower precision formats like INT8 (8-bit integer), we can achieve faster computations and lower memory usage, effectively reducing the model size by three-fourths (from 32-bit) or half (from 16-bit) with only a slight accuracy reduction, which is often outweighed by the speed gains.
Fine-tuning Llama 3 with Axolotl using ROCm on AMD GPUs
- 23 September 2024
Large Language Models (LLMs) have revolutionized the field of natural language processing, enabling machines to understand and generate human-like language. However, these models are often trained on vast amounts of general-purpose data, which can make them less effective for specific tasks or domains. Fine-tuning involves training a pre-trained LLM on a specialized dataset to enhance its performance on specific tasks. As Andrej Karpathy analogized, this process is akin to allowing someone to practice a particular skill. Just as a person might need to practice a skill in a specific context to become proficient, an LLM needs to be fine-tuned on a specific dataset to become proficient in a particular task. For instance, an LLM can be fine-tuned for tasks such as financial forecasting, technical support, legal advising, medical diagnosis, or even instruction following. By fine-tuning an LLM, organizations can achieve better results and improve information security by limiting the exposure of sensitive data.
Inferencing and serving with vLLM on AMD GPUs
- 19 September 2024
In the rapidly evolving field of artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools for understanding and generating human-like text. However, deploying these models efficiently at scale presents significant challenges. This is where vLLM comes into play. vLLM is an innovative open-source library designed to optimize the serving of LLMs using advanced techniques. Central to vLLM is PagedAttention, a novel algorithm that enhances the efficiency of the model’s attention mechanism by managing it as virtual memory. This approach optimizes GPU memory utilization, facilitating the processing of longer sequences and enabling more efficient handling of large models within existing hardware constraints. Additionally, vLLM incorporates continuous batching to maximize throughput and minimize latency. By leveraging these cutting-edge techniques, vLLM significantly improves the performance and scalability of LLM deployment, allowing organizations to harness the power of state-of-the-art AI models more effectively and economically.
Optimize GPT Training: Enabling Mixed Precision Training in JAX using ROCm on AMD GPUs
- 06 September 2024
This blog builds on the nanoGPT model we discussed in A Guide to Implementing and Training Generative Pre-trained Transformers (GPT) in JAX on AMD GPUs. Here we will show you how to incorporate mixed precision training to the JAX-implemented nanoGPT model we discussed in our previous blog.
Image Classification with BEiT, MobileNet, and EfficientNet using ROCm on AMD GPUs
- 03 September 2024
Image classification is a key task in computer vision aiming at “understanding” an entire image. The outcome of an image classifier is a label or a category for the image as a whole, unlike object recognition where the task is to detect and classify multiple objects within an image.
Benchmarking Machine Learning using ROCm and AMD GPUs: Reproducing Our MLPerf Inference Submission
- 28 August 2024
Measuring the performance of new technologies is as old as human history, and often as intriguing (consider for example that we still compare the performance of new electric vehicle motors using horsepower). In the rapidly advancing field of machine learning (ML) MLPerf was established by MLCommons on May 2nd 2018 and quickly became the golden standard of measuring the accuracy, speed, and efficiency of AI. MLPerf provides benchmarks on training, HPC and Inference performance. Companies across the industry use MLPerf submissions to evaluate the performance of various GPUs and software platforms, and make their technology adoption decisions based on these results.
Performing natural language processing tasks with LLMs on ROCm running on AMD GPUs
- 21 August 2024
In this blog you will learn how to use ROCm, running on AMD’s Instinct GPUs, for a range of popular and useful natural language processing (NLP) tasks, using different large language models (LLMs). The blog includes a simple to follow hands-on guide that shows you how to implement LLMs for core NLP applications ranging from text generation and sentiment analysis to extractive question answering (QA), and solving a math problem.
Using AMD GPUs for Enhanced Time Series Forecasting with Transformers
- 19 August 2024
Time series forecasting (TSF) is a key concept in fields such as signal processing, data science, and machine learning (ML). TSF involves predicting future behavior of a system by analyzing its past temporal patterns, using historical data to forecast future data points. Classical approaches to TSF relied on a variety of statistical methods. Recently, machine learning techniques have been increasingly used for TSF, generating discussions within the community about whether these modern approaches outperform the classical statistical ones (see: Are Transformers Effective for Time Series Forecasting? and Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)).
Inferencing with Grok-1 on AMD GPUs
- 09 August 2024
We demonstrate that the massive Grok-1 model from xAI can run seamlessly on the AMD MI300X GPU accelerator by leveraging the ROCm software platform.
Optimizing RoBERTa: Fine-Tuning with Mixed Precision on AMD
- 29 July 2024
In this blog we explore how to fine-tune the Robustly Optimized BERT Pretraining Approach (RoBERTa) large language model, with emphasis on PyTorch’s mixed precision capabilities. Specifically, we explore using AMD GPUs for mixed precision fine-tuning to achieve faster model training without any major impacts on accuracy.
Graph analytics on AMD GPUs using Gunrock
- 29 July 2024
Graphs and graph analytics are related concepts that can help us understand complex data and relationships. In this context, a graph is a mathematical model that represents entities (called nodes or vertices) and their connections (called edges or links). And graph analytics is a form of data analysis that uses graph structures and algorithms to reveal insights from the data.
Using statistical methods to reliably compare algorithm performance in large generative AI models with JAX Profiler on AMD GPUs
- 22 July 2024
This blog provides a comprehensive guide on measuring and comparing the performance of various algorithms in a JAX-implemented generative AI model. Leveraging the JAX Profiler and statistical analysis, this blog demonstrates how to reliably evaluate key steps and compare algorithm performance on AMD GPUs.
DBRX Instruct on AMD GPUs
- 11 July 2024
In this blog, we showcase DBRX Instruct, a mixture-of-experts large language model developed by Databricks, on a ROCm-capable system with AMD GPUs.
Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm
- 11 July 2024
PyTorch 2.0 introduces torch.compile()
, a tool to vastly accelerate PyTorch code and models. By converting PyTorch code into highly optimized kernels, torch.compile
delivers substantial performance improvements with minimal changes to the existing codebase. This feature allows for precise optimization of individual functions, entire modules, and complex training loops, providing a versatile and powerful tool for enhancing computational efficiency.
Accelerating models on ROCm using PyTorch TunableOp
- 03 July 2024
In this blog, we will show how to leverage PyTorch TunableOp to accelerate models using ROCm on AMD GPUs. We will discuss the basics of General Matrix Multiplications (GEMMs), show an example of tuning a single GEMM, and finally, demonstrate real-world performance gains on an LLM (gemma) using TunableOp.
A Guide to Implementing and Training Generative Pre-trained Transformers (GPT) in JAX on AMD GPUs
- 02 July 2024
2 July, 2024 by Douglas Jia.
Mamba on AMD GPUs with ROCm
- 28 June 2024
28, Jun 2024 by Sean Song, Jassani Adeem, Moskvichev Arseny.
Fine-tuning and Testing Cutting-Edge Speech Models using ROCm on AMD GPUs
- 27 June 2024
AI Voice agents, or voice bots, are designed to communicate with people using a spoken language. Voice bots are commonly deployed in customer service and personal assistant applications, and have the potential to enter and revolutionize almost every aspect of people’s interaction with technology that can benefit from the use of voice. Automatic Speech Recognition (ASR), the technology that processes human speech into text, is essential for the creation of AI Voice agents. In this blog post we will provide you with a hands-on introduction to the deployment of three machine learning ASR models, using ROCm on AMD GPUs.
Unveiling performance insights with PyTorch Profiler on an AMD GPU
- 29 May 2024
29 May, 2024 by Phillip Dang.
Panoptic segmentation and instance segmentation with Detectron2 on AMD GPUs
- 23 May 2024
23, May 2024 by Vara Lakshmi Bayanagari.
Accelerating Large Language Models with Flash Attention on AMD GPUs
- 15 May 2024
15, May 2024 by Clint Greene.
Step-by-Step Guide to Use OpenLLM on AMD GPUs
- 01 May 2024
OpenLLM is an open-source platform designed to facilitate the deployment and utilization of large language models (LLMs), supporting a wide range of models for diverse applications, whether in cloud environments or on-premises. In this tutorial, we will guide you through the process of starting an LLM server using OpenLLM, enabling interaction with the server from your local machine, with special emphasis on leveraging the capabilities of AMD GPUs.
Training a Neural Collaborative Filtering (NCF) Recommender on an AMD GPU
- 30 April 2024
30, Apr 2024 by Vara Lakshmi Bayanagari.
Multimodal (Visual and Language) understanding with LLaVA-NeXT
- 26 April 2024
26, Apr 2024 by Phillip Dang.
Unlocking Vision-Text Dual-Encoding: Multi-GPU Training of a CLIP-Like Model
- 24 April 2024
24 Apr, 2024 by Sean Song.
Transforming Words into Motion: A Guide to Video Generation with AMD GPU
- 24 April 2024
24 Apr, 2024 by Douglas Jia.
Interacting with Contrastive Language-Image Pre-Training (CLIP) model on AMD GPU
- 16 April 2024
16, Apr 2024 by Sean Song.
Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs
- 16 April 2024
16 Apr, 2024 by Douglas Jia.
Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama Model on a single AMD GPU
- 15 April 2024
15, Apr 2024 by Sean Song.
Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU
- 15 April 2024
15, Apr 2024 by Sean Song.
GPU Unleashed: Training Reinforcement Learning Agents with Stable Baselines3 on an AMD GPU in Gymnasium Environment
- 11 April 2024
11 Apr, 2024 by Douglas Jia.
Using the ChatGLM-6B bilingual language model with AMD GPUs
- 04 April 2024
4, Apr 2024 by Phillip Dang.
Total body segmentation using MONAI Deploy on an AMD GPU
- 04 April 2024
4, Apr 2024 by Vara Lakshmi Bayanagari.
Building semantic search with SentenceTransformers on AMD
- 04 April 2024
4 Apr, 2024 by Fabricio Flores.
Scale AI applications with Ray
- 01 April 2024
1, Apr 2024 by Vicky Tsang<vicktsan>, {hoverxref}Logan Grado, {hoverxref}
Eliot Li
Automatic mixed precision in PyTorch using AMD GPUs
- 29 March 2024
As models increase in size, the time and memory needed to train them–and consequently, the cost–also increases. Therefore, any measures we take to reduce training time and memory usage can be highly beneficial. This is where Automatic Mixed Precision (AMP) comes in.
Large language model inference optimizations on AMD GPUs
- 15 March 2024
15, Mar 2024 by Seungrok Jung.
Efficient image generation with Stable Diffusion models and ONNX Runtime using AMD GPUs
- 23 February 2024
23 Feb, 2024 by Douglas Jia.
Simplifying deep learning: A guide to PyTorch Lightning
- 08 February 2024
8, Feb 2024 by Phillip Dang.
Two-dimensional images to three-dimensional scene mapping using NeRF on an AMD GPU
- 07 February 2024
7, Feb 2024 by Vara Lakshmi Bayanagari.
Using LoRA for efficient fine-tuning: Fundamental principles
- 05 February 2024
5, Feb 2024 by Sean Song.
Fine-tune Llama model with LoRA: Customizing a large language model for question-answering
- 01 February 2024
1, Feb 2024 by Sean Song.
Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering
- 01 February 2024
1, Feb 2024 by Sean Song.
Pre-training BERT using Hugging Face & TensorFlow on an AMD GPU
- 29 January 2024
29, Jan 2024 by Vara Lakshmi Bayanagari.
Pre-training BERT using Hugging Face & PyTorch on an AMD GPU
- 26 January 2024
26, Jan 2024 by Vara Lakshmi Bayanagari.
Accelerating XGBoost with Dask using multiple AMD GPUs
- 26 January 2024
26 Jan, 2024 by Clint Greene.
Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs
- 24 January 2024
24 Jan, 2024 by Douglas Jia.
Efficient image generation with Stable Diffusion models and AITemplate using AMD GPUs
- 24 January 2024
24 Jan, 2024 by Douglas Jia.
Efficient deployment of large language models with Text Generation Inference on AMD GPUs
- 24 January 2024
24 Jan, 2024 by Douglas Jia.
Jacobi Solver with HIP and OpenMP offloading
- 15 September 2023
15 Sept, 2023 by Asitav Mishra, Rajat Arora, Justin Chang.
Finite difference method - Laplacian part 4
- 18 July 2023
18 Jul, 2023 by Justin Chang, Thomas Gibson, Sean Miller.
Finite difference method - Laplacian part 3
- 11 May 2023
11 May, 2023 by Justin Chang, Rajat Arora, Thomas Gibson, Sean Miller, Ossian O’Reilly.
Finite difference method - Laplacian part 2
- 04 January 2023
4 Jan, 2023 by Justin Chang, Rajat Arora, Thomas Gibson, Sean Miller, Ossian O’Reilly.
Finite difference method - Laplacian part 1
- 14 November 2022
14 Nov, 2022 by Justin Chang, Rajat Arora, Thomas Gibson, Sean Miller, Ossian O’Reilly.