Posted in 2024

Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators

10 December 2024

This blog is contributed by Zyphra: a Palo Alto-based AI research lab and AMD Instinct Partner.

Read more ...

Transformer based Encoder-Decoder models for image-captioning on AMD GPUs

03 December 2024

Image captioning, or the GenAI-based automatic generation of concise textual descriptions of images, has immensely important real-world applications. For example, image captioning can provide visually impaired users with textual descriptions of images for improved accessibility, image captioning can add textual descriptions to products in e-commerce applications and help children map images to their textual descriptions in early childhood educational apps. Image captioning can automatically describe objects and events in security camera footage in surveillance applications and can enable robots to auto-generate textual captions for objects and events they encountered in human-robot interaction (HRI) applications, and many more applications. Image captioning is a sequence-to-sequence (seq2seq) machine learning task: a model converting a sequence from one domain (in this case, the image), to another (its textual description). In image captioning the image is partitioned into a sequence of patches. This sequence of image patches is then converted by the model to a corresponding sequence of text tokens.

Read more ...

SGLang: Fast Serving Framework for Large Language and Vision-Language Models on AMD Instinct GPUs

13 November 2024

In the rapidly evolving landscape of artificial intelligence, the ability to deploy large language models (LLMs) and vision-language models (VLMs) efficiently is crucial for real-time applications. SGLang is an open-source framework designed to meet these demands by delivering fast backend runtime, a flexible frontend language, and extensive model support for a variety of LLMs and VLMs.

Read more ...

Quantized 8-bit LLM training and inference using bitsandbytes on AMD GPUs

13 November 2024

In this blog post we will cover the bitsandbytes 8-bit representations. As you will see, the bitsandbytes 8-bit representations significantly help reduce the memory needed for fine-tuning and inferencing LLMs. There are many quantization techniques used in the field to decrease a model size, but bitsandbytes offers quantization to decrease the size of optimizer states as well. This post will help you understand the basic principles underlying the bitsandbytes 8-bit representations, explain the bitsandbytes 8-bit optimizer and LLM.int8 techniques, and show you how to implement these on AMD GPUs using ROCm.

Read more ...

Introducing AMD’s Next-Gen Fortran Compiler

13 November 2024

We are excited to share a brief preview of AMD’s Next-Gen Fortran Compiler, our new open source Fortran complier supporting OpenMP offloading. AMD’s Next-Gen Fortran Compiler is a downstream flavor of LLVM Flang, optimized for AMD GPUs. Our Next-Gen Fortran Compiler enables OpenMP offloading and offers a direct interface to ROCm and HIP. In this blog post you will:

Read more ...

Distributed Data Parallel Training on AMD GPU with ROCm

01 November 2024

With the increase in complexity and size of machine learning models, the demand for computational resources grows. Training on a single GPU can become a bottleneck for deep learning applications, especially with large datasets and models that are slow to train on a single GPU. Parallelized training addresses this challenge. Out of the various forms of parallelized training, this blog focuses on Distributed Data Parallel (DDP), a key feature in PyTorch that accelerates training across multiple GPUs and nodes.

Read more ...

Torchtune on AMD GPUs How-To Guide: Fine-tuning and Scaling LLMs with Multi-GPU Power

24 October 2024

This blog provides a thorough how-to guide on using Torchtune to fine-tune and scale large language models (LLMs) with AMD GPUs. Torchtune is a PyTorch library designed to let you easily fine-tune and experiment with LLMs. Using Torchtune’s flexibility and scalability, we show you how to fine-tune the Llama-3.1-8B model for summarization tasks using the EdinburghNLP/xsum dataset. Using LoRA(Low-Rank Adaptation), a parameter-efficient fine-tuning technique, Torchtune enables efficient training while maintaining performance across a different number of GPUs (2, 4, 6, and 8). This post also highlights how Torchtune’s distributed training capabilities allow users to scale up LLM fine-tuning on multiple GPUs to reduce training time while maintaining the quality of the trained model, demonstrating its potential and usage on modern AMD hardware using ROCm.

Read more ...

CTranslate2: Efficient Inference with Transformer Models on AMD GPUs

24 October 2024

Transformer models have revolutionized natural language processing (NLP) by delivering high-performance results in tasks like machine translation, text summarization, text generation, and speech recognition. However, deploying these models in production can be challenging due to their high computational and memory requirements. CTranslate2 addresses these challenges by providing a custom runtime that implements various optimization techniques to accelerate Transformer models during inference.

Read more ...

Inference with Llama 3.2 Vision LLMs on AMD GPUs Using ROCm

23 October 2024

Meta’s Llama models now support multimodal capabilities, expanding their functionality beyond traditional text-only applications. The Llama 3.2 models are available in a range of sizes, including medium-sized 11B and 90B multimodal models for vision-text reasoning tasks, and lightweight 1B and 3B text-only models designed for edge and mobile devices.

Read more ...

Speed Up Text Generation with Speculative Sampling on AMD GPUs

15 October 2024

24 February 2025

As the size of transformer models grow, so does the cost of conducting inference, impacting latency and throughput. Compression methods such as quantization and distillation, as well as hardware-aware optimizations such as Flash Attention and Triton, have been proposed to cut down the computation cost at different levels. However, these models either compromise on accuracy or require major changes to the model implementation.

Read more ...

Multinode Fine-Tuning of Stable Diffusion XL on AMD GPUs with Hugging Face Accelerate and OCI’s Kubernetes Engine (OKE)

15 October 2024

As the scale and complexity of generative AI and deep learning models grow, multinode training, basically dividing a training job across several processors, has become an essential strategy to speed up training and fine-tuning processes of large generative AI models like SDXL. By distributing the training workload across multiple GPUs on multiple nodes, multinode setups can significantly accelerate the training process. In this blog post we will show you, step-by step, how to set-up and fine-tune a Stable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI) Kubernetes Engine (OKE) on AMD GPUs using ROCm.

Read more ...

Supercharging JAX with Triton Kernels on AMD GPUs

09 October 2024

Ready to supercharge your deep learning applications on AMD GPUs? In this blog, we’ll show you how to develop a custom fused dropout activation kernel for matrices in Triton, seamlessly call it from JAX, and benchmark its performance with ROCm. This powerful combination will take your model’s performance to the next level.

Read more ...

Leaner LLM Inference with INT8 Quantization on AMD GPUs using PyTorch

03 October 2024

With the scale of large language models (LLMs) reaching hundred of billions of parameters, the ways we represent data within these enormous models dramatically impacts the resources required to train them (e.g. the number of GPUs needed for inference). In our previous blogs (JAX mixed precision training; PyTorch AMP), we already demonstrated how mixed precision training can accelerate LLMs training process. In this blog post we will push things further and show you how quantization into an even lower precision data formats can speed up inference, saving time and memory, without sacrificing the overall performance of the model. Quantization is a technique where the precision of a model’s parameters is reduced from a 32-bit floating point (FP32) or a 16-bit floating point (FP16) to an 8-bit integer (INT8). Standard models typically use 32-bit floating-point (FP32) precision. However, this higher precision is not always necessary for inference tasks. By converting model weights and activations to lower precision formats like INT8 (8-bit integer), we can achieve faster computations and lower memory usage, effectively reducing the model size by three-fourths (from 32-bit) or half (from 16-bit) with only a slight accuracy reduction, which is often outweighed by the speed gains.

Read more ...

Fine-tuning Llama 3 with Axolotl using ROCm on AMD GPUs

23 September 2024

Large Language Models (LLMs) have revolutionized the field of natural language processing, enabling machines to understand and generate human-like language. However, these models are often trained on vast amounts of general-purpose data, which can make them less effective for specific tasks or domains. Fine-tuning involves training a pre-trained LLM on a specialized dataset to enhance its performance on specific tasks. As Andrej Karpathy analogized, this process is akin to allowing someone to practice a particular skill. Just as a person might need to practice a skill in a specific context to become proficient, an LLM needs to be fine-tuned on a specific dataset to become proficient in a particular task. For instance, an LLM can be fine-tuned for tasks such as financial forecasting, technical support, legal advising, medical diagnosis, or even instruction following. By fine-tuning an LLM, organizations can achieve better results and improve information security by limiting the exposure of sensitive data.

Read more ...

Inferencing and serving with vLLM on AMD GPUs

19 September 2024

09 June 2025

In the rapidly evolving field of artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools for understanding and generating human-like text. However, deploying these models efficiently at scale presents significant challenges. This is where vLLM comes into play. vLLM is an innovative open-source library designed to optimize the serving of LLMs using advanced techniques. Central to vLLM is PagedAttention, a novel algorithm that enhances the efficiency of the model’s attention mechanism by managing it as virtual memory. This approach optimizes GPU memory utilization, facilitating the processing of longer sequences and enabling more efficient handling of large models within existing hardware constraints. Additionally, vLLM incorporates continuous batching to maximize throughput and minimize latency. By leveraging these cutting-edge techniques, vLLM significantly improves the performance and scalability of LLM deployment, allowing organizations to harness the power of state-of-the-art AI models more effectively and economically.

Read more ...

Enhancing vLLM Inference on AMD GPUs

19 September 2024

09 June 2025

In this blog, we’ll demonstrate the latest performance enhancements in vLLM inference on AMD Instinct accelerators using ROCm 6.2. In a nutshell, vLLM optimizes GPU memory utilization, allowing more efficient handling of large language models (LLMs) within existing hardware constraints, maximizing throughput and minimizing latency. We start the blog by briefly explaining how causal language models like Llama 3 and ChatGPT generate text, motivating the need to enhance throughput and reduce latency. If you’re new to vLLM, we also recommend reading our introduction to Inferencing and serving with vLLM on AMD GPUs. ROCm 6.2 introduces support for the following vLLM features which we will use in this blog post.

Read more ...

Getting to Know Your GPU: A Deep Dive into AMD SMI

17 September 2024

For system administrators and power users working with AMD hardware, performance optimization and efficient monitoring of resources is paramount. The AMD System Management Interface command-line tool, amd-smi, addresses these needs.

Read more ...

Introducing the AMD ROCm™ Offline Installer Creator: Simplifying Deployment for AI and HPC

10 September 2024

Document headings start at H2, not H1 [myst.header]

Read more ...

Optimize GPT Training: Enabling Mixed Precision Training in JAX using ROCm on AMD GPUs

06 September 2024

This blog builds on the nanoGPT model we discussed in A Guide to Implementing and Training Generative Pre-trained Transformers (GPT) in JAX on AMD GPUs. Here we will show you how to incorporate mixed precision training to the JAX-implemented nanoGPT model we discussed in our previous blog.

Read more ...

Image Classification with BEiT, MobileNet, and EfficientNet using ROCm on AMD GPUs

03 September 2024

Image classification is a key task in computer vision aiming at “understanding” an entire image. The outcome of an image classifier is a label or a category for the image as a whole, unlike object recognition where the task is to detect and classify multiple objects within an image.

Read more ...

Seismic stencil codes - part 3

29 August 2024

12 Aug, 2024 by

and .

Read more ...

Seismic stencil codes - part 2

29 August 2024

12 Aug, 2024 by

and .

Read more ...

Seismic stencil codes - part 1

29 August 2024

12 Aug, 2024 by

and .

Read more ...

Benchmarking Machine Learning using ROCm and AMD GPUs: Reproducing Our MLPerf Inference Submission

28 August 2024

Measuring the performance of new technologies is as old as human history, and often as intriguing (consider for example that we still compare the performance of new electric vehicle motors using horsepower). In the rapidly advancing field of machine learning (ML) MLPerf was established by MLCommons on May 2nd 2018 and quickly became the golden standard of measuring the accuracy, speed, and efficiency of AI. MLPerf provides benchmarks on training, HPC and Inference performance. Companies across the industry use MLPerf submissions to evaluate the performance of various GPUs and software platforms, and make their technology adoption decisions based on these results.

Read more ...

Performing natural language processing tasks with LLMs on ROCm running on AMD GPUs

21 August 2024

In this blog you will learn how to use ROCm, running on AMD’s Instinct GPUs, for a range of popular and useful natural language processing (NLP) tasks, using different large language models (LLMs). The blog includes a simple to follow hands-on guide that shows you how to implement LLMs for core NLP applications ranging from text generation and sentiment analysis to extractive question answering (QA), and solving a math problem.

Read more ...

Using AMD GPUs for Enhanced Time Series Forecasting with Transformers

19 August 2024

Time series forecasting (TSF) is a key concept in fields such as signal processing, data science, and machine learning (ML). TSF involves predicting future behavior of a system by analyzing its past temporal patterns, using historical data to forecast future data points. Classical approaches to TSF relied on a variety of statistical methods. Recently, machine learning techniques have been increasingly used for TSF, generating discussions within the community about whether these modern approaches outperform the classical statistical ones (see: Are Transformers Effective for Time Series Forecasting? and Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)).

Read more ...

Inferencing with Grok-1 on AMD GPUs

09 August 2024

We demonstrate that the massive Grok-1 model from xAI can run seamlessly on the AMD MI300X GPU accelerator by leveraging the ROCm software platform.

Read more ...

Optimizing RoBERTa: Fine-Tuning with Mixed Precision on AMD

29 July 2024

In this blog we explore how to fine-tune the Robustly Optimized BERT Pretraining Approach (RoBERTa) large language model, with emphasis on PyTorch’s mixed precision capabilities. Specifically, we explore using AMD GPUs for mixed precision fine-tuning to achieve faster model training without any major impacts on accuracy.

Read more ...

Graph analytics on AMD GPUs using Gunrock

29 July 2024

Graphs and graph analytics are related concepts that can help us understand complex data and relationships. In this context, a graph is a mathematical model that represents entities (called nodes or vertices) and their connections (called edges or links). And graph analytics is a form of data analysis that uses graph structures and algorithms to reveal insights from the data.

Read more ...

Using statistical methods to reliably compare algorithm performance in large generative AI models with JAX Profiler on AMD GPUs

22 July 2024

This blog provides a comprehensive guide on measuring and comparing the performance of various algorithms in a JAX-implemented generative AI model. Leveraging the JAX Profiler and statistical analysis, this blog demonstrates how to reliably evaluate key steps and compare algorithm performance on AMD GPUs.

Read more ...

DBRX Instruct on AMD GPUs

11 July 2024

In this blog, we showcase DBRX Instruct, a mixture-of-experts large language model developed by Databricks, on a ROCm-capable system with AMD GPUs.

Read more ...

Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm

11 July 2024

PyTorch 2.0 introduces torch.compile(), a tool to vastly accelerate PyTorch code and models. By converting PyTorch code into highly optimized kernels, torch.compile delivers substantial performance improvements with minimal changes to the existing codebase. This feature allows for precise optimization of individual functions, entire modules, and complex training loops, providing a versatile and powerful tool for enhancing computational efficiency.

Read more ...

Accelerating models on ROCm using PyTorch TunableOp

03 July 2024

In this blog, we will show how to leverage PyTorch TunableOp to accelerate models using ROCm on AMD GPUs. We will discuss the basics of General Matrix Multiplications (GEMMs), show an example of tuning a single GEMM, and finally, demonstrate real-world performance gains on an LLM (gemma) using TunableOp.

Read more ...

A Guide to Implementing and Training Generative Pre-trained Transformers (GPT) in JAX on AMD GPUs

02 July 2024

2 July, 2024 by

.

Read more ...

Mamba on AMD GPUs with ROCm

28 June 2024

28, Jun 2024 by

, , .

Read more ...

Deep Learning Recommendation Models on AMD GPUs

28 June 2024

28, June 2024 by

.

Read more ...

Fine-tuning and Testing Cutting-Edge Speech Models using ROCm on AMD GPUs

27 June 2024

AI Voice agents, or voice bots, are designed to communicate with people using a spoken language. Voice bots are commonly deployed in customer service and personal assistant applications, and have the potential to enter and revolutionize almost every aspect of people’s interaction with technology that can benefit from the use of voice. Automatic Speech Recognition (ASR), the technology that processes human speech into text, is essential for the creation of AI Voice agents. In this blog post we will provide you with a hands-on introduction to the deployment of three machine learning ASR models, using ROCm on AMD GPUs.

Read more ...

TensorFlow Profiler in practice: Optimizing TensorFlow models on AMD GPUs

18 June 2024

TensorFlow Profiler consists of a set of tools designed to measure resource utilization and performance during the execution of TensorFlow models. It offers insights into how a model interacts with hardware resources, including execution time and memory usage. TensorFlow Profiler helps in pinpointing performance bottlenecks, allowing us to fine-tune the execution of models for improved efficiency and faster outcomes which can be crucial in scenarios where near-real-time predictions are required.

Read more ...

Stone Ridge Expands Reservoir Simulation Options with AMD Instinct™ Accelerators

10 June 2024

Stone Ridge Technology (SRT) pioneered the use of GPUs for high performance reservoir simulation (HPC) nearly a decade ago with ECHELON, its flagship software product. ECHELON, the first of its kind, engineered from the outset to harness the full potential of massively parallel GPUs, stands apart in the industry for its power, efficiency, and accuracy. Now, ECHELON has added support for AMDInstinct accelerators into its simulation engine, offering new flexibility and optionality to its clients.

Read more ...

Segment Anything with AMD GPUs

04 June 2024

4 Jun, 2024 by

.

Read more ...

SmoothQuant model inference on AMD Instinct MI300X using Composable Kernel

31 May 2024

The AMD ROCm™ Composable Kernel (CK) library provides a programming model for writing performance-critical kernels for machine learning workloads. It generates a general-purpose kernel during the compilation phase through a C++ template, enabling developers to achieve operation fusions on different data precisions.

Read more ...

Unveiling performance insights with PyTorch Profiler on an AMD GPU

29 May 2024

29 May, 2024 by

.

Read more ...

Panoptic segmentation and instance segmentation with Detectron2 on AMD GPUs

23 May 2024

23, May 2024 by

.

Read more ...

Siemens taps AMD Instinct™ GPUs to expand high-performance hardware options for Simcenter STAR-CCM+

16 May 2024

Siemens recently announced that its Simcenter STAR-CCM+ multi-physics computational fluid dynamics (CFD) software now supports AMD Instinct™ GPUs for GPU-native computation. This move addresses its users’ needs for computational efficiency, reduced simulation costs and energy usage, and greater hardware choice.

Read more ...

AMD Collaboration with the University of Michigan offers High Performance Open-Source Solutions to the Bioinformatics Community

16 May 2024

Long read DNA sequencing technology is revolutionizing genetic diagnostics and precision medicine by helping us discover structural variants and assemble whole genomes. It also helps us study evolutionary relationships. Lower sequencing costs and high-throughput portable long read sequencers are revolutionizing precision medicine today. Long read sequencers from the top manufacturers including Oxford Nanopore (ONT) and PacBio, can produce reads that are much longer than previous generations of sequencers. However, long reads vary in length and are significantly more error prone than short reads. Sequence alignment (on CPUs) is one of the main bottlenecks in long read processing workflows.

Read more ...

Accelerating Large Language Models with Flash Attention on AMD GPUs

15 May 2024

15, May 2024 by

.

Read more ...

Reading AMD GPU ISA

13 May 2024

For an application developer it is often helpful to read the Instruction Set Architecture (ISA) for the GPU architecture that is used to perform its computations. Understanding the instructions of the pertinent code regions of interest can help in debugging and achieving performance optimization of the application.

Read more ...

AMD in Action: Unveiling the Power of Application Tracing and Profiling

07 May 2024

Rocprof is a robust tool designed to analyze and optimize the performance of HIP programs on AMD ROCm platforms, helping developers pinpoint and resolve performance bottlenecks. Rocprof provides a variety of profiling data, including performance counters, hardware traces, and runtime API/activity traces.

Read more ...

Step-by-Step Guide to Use OpenLLM on AMD GPUs

01 May 2024

OpenLLM is an open-source platform designed to facilitate the deployment and utilization of large language models (LLMs), supporting a wide range of models for diverse applications, whether in cloud environments or on-premises. In this tutorial, we will guide you through the process of starting an LLM server using OpenLLM, enabling interaction with the server from your local machine, with special emphasis on leveraging the capabilities of AMD GPUs.

Read more ...

Inferencing with Mixtral 8x22B on AMD GPUs

01 May 2024

1, May 2024 by

.

Read more ...

Training a Neural Collaborative Filtering (NCF) Recommender on an AMD GPU

30 April 2024

30, Apr 2024 by

.

Read more ...

Table Question-Answering with TaPas

26 April 2024

26 Apr, 2024 by

.

Read more ...

Multimodal (Visual and Language) understanding with LLaVA-NeXT

26 April 2024

26, Apr 2024 by

.

Read more ...

Application portability with HIP

26 April 2024

Many scientific applications run on AMD-equipped computing platforms and supercomputers, including Frontier, the first Exascale system in the world. These applications, coming from a myriad of science domains, were ported to run on AMD GPUs using the Heterogeneous-compute Interface for Portability (HIP) abstraction layer. HIP enables these High-Performance Computing (HPC) facilities to transition their CUDA codes to run and take advantage of the latest AMD GPUs. The effort involved in porting these scientific applications varies from a few hours to a few weeks and largely depends on the complexity of the original source code. Figure 1 shows several examples of applications that have been ported and the corresponding porting effort.

Read more ...

Unlocking Vision-Text Dual-Encoding: Multi-GPU Training of a CLIP-Like Model

24 April 2024

24 Apr, 2024 by

.

Read more ...

Transforming Words into Motion: A Guide to Video Generation with AMD GPU

24 April 2024

24 Apr, 2024 by

.

Read more ...

C++17 parallel algorithms and HIPSTDPAR

18 April 2024

The C++17 standard added the concept of parallel algorithms to the pre-existing C++ Standard Library. The parallel version of algorithms like std::transform maintain the same signature as the regular serial version, except for the addition of an extra parameter specifying the execution policy to use. This flexibility allows users that are already using the C++ Standard Library algorithms to take advantage of multi-core architectures by just introducing minimal changes to their code.

Read more ...

Inferencing with AI2’s OLMo model on AMD GPU

17 April 2024

In this blog, we will show you how to generate text using AI2’s OLMo model on AMD GPU.

Read more ...

Text Summarization with FLAN-T5

16 April 2024

In this blog, we showcase the language model FLAN-T5 and how to fine-tune it on a summarization task with HuggingFace in an AMD GPUs + ROCm system.

Read more ...

Speech-to-Text on an AMD GPU with Whisper

16 April 2024

16 Apr, 2024 by

.

Read more ...

PyTorch C++ Extension on AMD GPU

16 April 2024

16, Apr 2024 by

.

Read more ...

Programming AMD GPUs with Julia

16 April 2024

Julia is a high-level, general-purpose dynamic programming language that automatically compiles to efficient native code via LLVM, and supports multiple platforms. With LLVM, comes the support for programming GPUs, including AMD GPUs.

Read more ...

Program Synthesis with CodeGen

16 April 2024

CodeGen is a family of standard transformer-based auto-regressive language models for program synthesis, which as defined by the authors as a method for generating computer programs that solve specified problems, using input-output examples or natural language descriptions.

Read more ...

Interacting with Contrastive Language-Image Pre-Training (CLIP) model on AMD GPU

16 April 2024

16, Apr 2024 by

.

Read more ...

Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs

16 April 2024

16 Apr, 2024 by

.

Read more ...

Affinity part 2 - System topology and controlling affinity

16 April 2024

In Part 1 of the Affinity blog series, we looked at the importance of setting affinity for High Performance Computing (HPC) workloads. In this blog post, our goals are the following:

Read more ...

Affinity part 1 - Affinity, placement, and order

16 April 2024

Modern hardware architectures are increasingly complex with multiple sockets, many cores in each Central Processing Unit (CPU), Graphical Processing Units (GPUs), memory controllers, Network Interface Cards (NICs), etc. Peripherals such as GPUs or memory controllers will often be local to a CPU socket. Such designs present interesting challenges in optimizing memory access times, data transfer times, etc. Depending on how the system is built, hardware components are connected, and the workload being run, it may be advantageous to use the resources of the system in a specific way. In this article, we will discuss the role of affinity, placement, and order in improving performance for High Performance Computing (HPC) workloads. A short case study is also presented to familiarize you with performance considerations on a node in the Frontier supercomputer. In a follow-up article, we also aim to equip you with the tools you need to understand your system’s hardware topology and set up affinity for your application accordingly.

Read more ...

Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama Model on a single AMD GPU

15 April 2024

15, Apr 2024 by

.

Read more ...

Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU

15 April 2024

15, Apr 2024 by

.

Read more ...

Developing Triton Kernels on AMD GPUs

15 April 2024

19 March 2025

OpenAI has developed a powerful GPU focused programming language and compiler called Triton that works seamlessly with AMD GPUs. The goal of Triton is to enable AI engineers and scientists to write high-performant GPU code with minimal expertise. Triton kernels are performant because of their blocked program representation, allowing them to be compiled into highly optimized binary code. Triton also leverages Python for kernel development, making it both familiar and accessible. And the kernels can be easily compiled by simply declaring the triton.jit python decorator before the kernel.

Read more ...

GPU Unleashed: Training Reinforcement Learning Agents with Stable Baselines3 on an AMD GPU in Gymnasium Environment

11 April 2024

11 Apr, 2024 by

.

Read more ...

ResNet for image classification using AMD GPUs

09 April 2024

9 Apr, 2024 by

.

Read more ...

Small language models with Phi-2

08 April 2024

Like many other LLMs, Phi-2 is a transformer-based model with a next-word prediction objective that is trained on billions of tokens. At 2.7 billion parameters, Phi-2 is a relatively small language model, but it achieves outstanding performance on a variety of tasks, including common sense reasoning, language understanding, math, and coding. For reference, GPT 3.5 has 175 billion parameters and the smallest version of LLaMA-2 has 7 billion parameters. According to Microsoft, Phi-2 is capable of matching or outperforming models up to 25 times larger due to more carefully curated training data and model scaling.

Read more ...

Using the ChatGLM-6B bilingual language model with AMD GPUs

04 April 2024

ChatGLM-6B is an open bilingual (Chinese-English) language model with 6.2 billion parameters. It’s optimized for Chinese conversation based on General Language Model (GLM) architecture. GLM is a pretraining framework that seeks to combine the strengths of autoencoder models (like BERT) and autoregressive models (like GPT). The GLM framework randomly blanks out continuous spans of tokens from the input text (autoencoding methodology) and trains the model to sequentially reconstruct the spans (autoregressive pretraining methodology).

Read more ...

Total body segmentation using MONAI Deploy on an AMD GPU

04 April 2024

4, Apr 2024 by

.

Read more ...

Retrieval Augmented Generation (RAG) using LlamaIndex

04 April 2024

4, Apr 2024 by

.

Read more ...

Image classification using Vision Transformer with AMD GPUs

04 April 2024

4 Apr, 2024 by

.

Read more ...

Building semantic search with SentenceTransformers on AMD

04 April 2024

4 Apr, 2024 by

.

Read more ...

Scale AI applications with Ray

01 April 2024

1, Apr 2024 by

Logan Grado, {hoverxref}Eliot Li.

Read more ...

Automatic mixed precision in PyTorch using AMD GPUs

29 March 2024

As models increase in size, the time and memory needed to train them–and consequently, the cost–also increases. Therefore, any measures we take to reduce training time and memory usage can be highly beneficial. This is where Automatic Mixed Precision (AMP) comes in.

Read more ...

Large language model inference optimizations on AMD GPUs

15 March 2024

15, Mar 2024 by

.

Read more ...

Building a decoder transformer model on AMD GPU(s)

12 March 2024

12, Mar 2024 by

.

Read more ...

Question-answering Chatbot with LangChain on an AMD GPU

11 March 2024

11, Mar 2024 by

.

Read more ...

Music Generation With MusicGen on an AMD GPU

08 March 2024

MusicGen is an autoregressive, transformer-based model that predicts the next segment of a piece of music based on previous segments. This is a similar approach to language models predicting the next token.

Read more ...

Efficient image generation with Stable Diffusion models and ONNX Runtime using AMD GPUs

23 February 2024

23 Feb, 2024 by

.

Read more ...

Simplifying deep learning: A guide to PyTorch Lightning

08 February 2024

8, Feb 2024 by

.

Read more ...

Two-dimensional images to three-dimensional scene mapping using NeRF on an AMD GPU

07 February 2024

7, Feb 2024 by

.

Read more ...

Using LoRA for efficient fine-tuning: Fundamental principles

05 February 2024

5, Feb 2024 by

.

Read more ...

Fine-tune Llama model with LoRA: Customizing a large language model for question-answering

01 February 2024

1, Feb 2024 by

.

Read more ...

Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering

01 February 2024

1, Feb 2024 by

.

Read more ...

Pre-training BERT using Hugging Face & TensorFlow on an AMD GPU

29 January 2024

29, Jan 2024 by

.

Read more ...

Pre-training BERT using Hugging Face & PyTorch on an AMD GPU

26 January 2024

26, Jan 2024 by

.

Read more ...

Accelerating XGBoost with Dask using multiple AMD GPUs

26 January 2024

26 Jan, 2024 by

.

Read more ...

LLM distributed supervised fine-tuning with JAX

25 January 2024

25 Jan, 2024 by

.

Read more ...

Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs

24 January 2024

24 Jan, 2024 by

.

Read more ...

Efficient image generation with Stable Diffusion models and AITemplate using AMD GPUs

24 January 2024

24 Jan, 2024 by

.

Read more ...

Efficient deployment of large language models with Text Generation Inference on AMD GPUs

24 January 2024

24 Jan, 2024 by

.

Read more ...