Posts by Sean Song
PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm
- 09 February 2025
PyTorch Fully Sharded Data Parallel (FSDP) is a data parallelism technique that enables the training of large-scale models in a memory-efficient manner. FSDP achieves this memory efficiency by sharding model parameters, optimizer states, and/or gradients across GPUs, reducing the memory footprint required by each GPU. This enables the training of large-scale models with lower total GPU memory than DDP (Distributed Data Parallel), in which the model weights and optimizer states are replicated across all processes. To learn more about DDP, refer to Distributed Data Parallel (DDP) training on AMD GPU with ROCm.
Vision Mamba on AMD GPU with ROCm
- 24 January 2025
State Space Models (SSMs), such as Mamba, have emerged as a potential alternative to Transformer models. Vision backbones using only SSMs have yielded promising results. For more information about SSMs and Mamba’s performance on AMD hardware, see Mamba on AMD GPUs with ROCm. This blog explores Vision Mamba (Vim), an innovative and efficient backbone for vision tasks and evaluate its performance on AMD GPUs with ROCm. We’ll start with a brief introduction to Vision Mamba, followed by a step-by-step guide on training and running inference with Vision Mamba on AMD GPUs using ROCm.
Distributed Data Parallel Training on AMD GPU with ROCm
- 01 November 2024
With the increase in complexity and size of machine learning models, the demand for computational resources grows. Training on a single GPU can become a bottleneck for deep learning applications, especially with large datasets and models that are slow to train on a single GPU. Parallelized training addresses this challenge. Out of the various forms of parallelized training, this blog focuses on Distributed Data Parallel (DDP), a key feature in PyTorch that accelerates training across multiple GPUs and nodes.
Inference with Llama 3.2 Vision LLMs on AMD GPUs Using ROCm
- 23 October 2024
Meta’s Llama models now support multimodal capabilities, expanding their functionality beyond traditional text-only applications. The Llama 3.2 models are available in a range of sizes, including medium-sized 11B and 90B multimodal models for vision-text reasoning tasks, and lightweight 1B and 3B text-only models designed for edge and mobile devices.
Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm
- 11 July 2024
PyTorch 2.0 introduces torch.compile()
, a tool to vastly accelerate PyTorch code and models. By converting PyTorch code into highly optimized kernels, torch.compile
delivers substantial performance improvements with minimal changes to the existing codebase. This feature allows for precise optimization of individual functions, entire modules, and complex training loops, providing a versatile and powerful tool for enhancing computational efficiency.
Mamba on AMD GPUs with ROCm
- 28 June 2024
28, Jun 2024 by Sean Song, Jassani Adeem, Moskvichev Arseny.
Unlocking Vision-Text Dual-Encoding: Multi-GPU Training of a CLIP-Like Model
- 24 April 2024
24 Apr, 2024 by Sean Song.
Interacting with Contrastive Language-Image Pre-Training (CLIP) model on AMD GPU
- 16 April 2024
16, Apr 2024 by Sean Song.
Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama Model on a single AMD GPU
- 15 April 2024
15, Apr 2024 by Sean Song.
Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU
- 15 April 2024
15, Apr 2024 by Sean Song.
Using LoRA for efficient fine-tuning: Fundamental principles
- 05 February 2024
5, Feb 2024 by Sean Song.
Fine-tune Llama model with LoRA: Customizing a large language model for question-answering
- 01 February 2024
1, Feb 2024 by Sean Song.
Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering
- 01 February 2024
1, Feb 2024 by Sean Song.