Posts by Sean Song

LLM Quantization with Quark on AMD GPUs: Accuracy and Performance Evaluation

09 June 2025

As large language models (LLMs) grow in size and complexity, efficient inference becomes increasingly important. Quantization is a widely adopted technique to reduce memory usage and improve performance by representing weights and activations with lower-precision formats (e.g., FP16 to INT8 or FP8). This blog demonstrates how to use AMD’s Quark to quantize large language models (LLMs) on AMD GPUs, and evaluates the resulting accuracy and performance. Additionally, the runtime performance of the quantized model is benchmarked across two widely used inference frameworks: vLLM and SGLang.

Read more ...

PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm

09 February 2025

PyTorch Fully Sharded Data Parallel (FSDP) is a data parallelism technique that enables the training of large-scale models in a memory-efficient manner. FSDP achieves this memory efficiency by sharding model parameters, optimizer states, and/or gradients across GPUs, reducing the memory footprint required by each GPU. This enables the training of large-scale models with lower total GPU memory than DDP (Distributed Data Parallel), in which the model weights and optimizer states are replicated across all processes. To learn more about DDP, refer to Distributed Data Parallel (DDP) training on AMD GPU with ROCm.

Read more ...

Vision Mamba on AMD GPU with ROCm

24 January 2025

State Space Models (SSMs), such as Mamba, have emerged as a potential alternative to Transformer models. Vision backbones using only SSMs have yielded promising results. For more information about SSMs and Mamba’s performance on AMD hardware, see Mamba on AMD GPUs with ROCm. This blog explores Vision Mamba (Vim), an innovative and efficient backbone for vision tasks and evaluate its performance on AMD GPUs with ROCm. We’ll start with a brief introduction to Vision Mamba, followed by a step-by-step guide on training and running inference with Vision Mamba on AMD GPUs using ROCm.

Read more ...

Distributed Data Parallel Training on AMD GPU with ROCm

01 November 2024

With the increase in complexity and size of machine learning models, the demand for computational resources grows. Training on a single GPU can become a bottleneck for deep learning applications, especially with large datasets and models that are slow to train on a single GPU. Parallelized training addresses this challenge. Out of the various forms of parallelized training, this blog focuses on Distributed Data Parallel (DDP), a key feature in PyTorch that accelerates training across multiple GPUs and nodes.

Read more ...

Inference with Llama 3.2 Vision LLMs on AMD GPUs Using ROCm

23 October 2024

Meta’s Llama models now support multimodal capabilities, expanding their functionality beyond traditional text-only applications. The Llama 3.2 models are available in a range of sizes, including medium-sized 11B and 90B multimodal models for vision-text reasoning tasks, and lightweight 1B and 3B text-only models designed for edge and mobile devices.

Read more ...

Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm

11 July 2024

PyTorch 2.0 introduces torch.compile(), a tool to vastly accelerate PyTorch code and models. By converting PyTorch code into highly optimized kernels, torch.compile delivers substantial performance improvements with minimal changes to the existing codebase. This feature allows for precise optimization of individual functions, entire modules, and complex training loops, providing a versatile and powerful tool for enhancing computational efficiency.

Read more ...

Mamba on AMD GPUs with ROCm

28 June 2024

28, Jun 2024 by

, , .

Read more ...

Segment Anything with AMD GPUs

04 June 2024

4 Jun, 2024 by

.

Read more ...

Unlocking Vision-Text Dual-Encoding: Multi-GPU Training of a CLIP-Like Model

24 April 2024

24 Apr, 2024 by

.

Read more ...

Interacting with Contrastive Language-Image Pre-Training (CLIP) model on AMD GPU

16 April 2024

16, Apr 2024 by

.

Read more ...

Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama Model on a single AMD GPU

15 April 2024

15, Apr 2024 by

.

Read more ...

Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU

15 April 2024

15, Apr 2024 by

.

Read more ...

Using LoRA for efficient fine-tuning: Fundamental principles

05 February 2024

5, Feb 2024 by

.

Read more ...

Fine-tune Llama model with LoRA: Customizing a large language model for question-answering

01 February 2024

1, Feb 2024 by

.

Read more ...

Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering

01 February 2024

1, Feb 2024 by

.

Read more ...