Posts tagged Fine-Tuning

PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm

PyTorch Fully Sharded Data Parallel (FSDP) is a data parallelism technique that enables the training of large-scale models in a memory-efficient manner. FSDP achieves this memory efficiency by sharding model parameters, optimizer states, and/or gradients across GPUs, reducing the memory footprint required by each GPU. This enables the training of large-scale models with lower total GPU memory than DDP (Distributed Data Parallel), in which the model weights and optimizer states are replicated across all processes. To learn more about DDP, refer to Distributed Data Parallel (DDP) training on AMD GPU with ROCm.

Read more ...


Best practices for competitive inference optimization on AMD Instinct™ MI300X GPUs

Optimizing LLM performance on GPUs is challenging due to diverse model needs, memory constraints, and balancing latency and throughput. This document examines how hardware utilization, memory and communication bandwidth and scaling, contribute to inference performance, detailing optimal configurations for AMD Instinct™ MI300X GPUs.

Read more ...


Distributed fine-tuning of MPT-30B using Composer on AMD GPUs

Composer, developed by MosaicML, is an open-source deep learning training library built on top of PyTorch, designed to simplify and optimize distributed training workflows. It supports scalable training on multiple nodes and efficiently handles datasets of various sizes. Composer integrates advanced techniques such as PyTorch Fully Sharded Data Parallelism (FSDP), elastic sharded checkpointing, training callbacks, and speed-up algorithms to enhance training performance and flexibility. It closely resembles PyTorch’s torchrun and has demonstrated exceptional efficiency when scaling to hundreds of GPUs.

Read more ...


Quantized 8-bit LLM training and inference using bitsandbytes on AMD GPUs

In this blog post we will cover the bitsandbytes 8-bit representations. As you will see, the bitsandbytes 8-bit representations significantly help reduce the memory needed for fine-tuning and inferencing LLMs. There are many quantization techniques used in the field to decrease a model size, but bitsandbytes offers quantization to decrease the size of optimizer states as well. This post will help you understand the basic principles underlying the bitsandbytes 8-bit representations, explain the bitsandbytes 8-bit optimizer and LLM.int8 techniques, and show you how to implement these on AMD GPUs using ROCm.

Read more ...


Inference with Llama 3.2 Vision LLMs on AMD GPUs Using ROCm

Meta’s Llama models now support multimodal capabilities, expanding their functionality beyond traditional text-only applications. The Llama 3.2 models are available in a range of sizes, including medium-sized 11B and 90B multimodal models for vision-text reasoning tasks, and lightweight 1B and 3B text-only models designed for edge and mobile devices.

Read more ...


Multinode Fine-Tuning of Stable Diffusion XL on AMD GPUs with Hugging Face Accelerate and OCI’s Kubernetes Engine (OKE)

As the scale and complexity of generative AI and deep learning models grow, multinode training, basically dividing a training job across several processors, has become an essential strategy to speed up training and fine-tuning processes of large generative AI models like SDXL. By distributing the training workload across multiple GPUs on multiple nodes, multinode setups can significantly accelerate the training process. In this blog post we will show you, step-by step, how to set-up and fine-tune a Stable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI) Kubernetes Engine (OKE) on AMD GPUs using ROCm.

Read more ...


Table Question-Answering with TaPas

26 Apr, 2024 by Phillip Dang.

Read more ...


Multimodal (Visual and Language) understanding with LLaVA-NeXT

26, Apr 2024 by Phillip Dang.

Read more ...


Unlocking Vision-Text Dual-Encoding: Multi-GPU Training of a CLIP-Like Model

24 Apr, 2024 by Sean Song.

Read more ...


Text Summarization with FLAN-T5

16, Apr 2024 by Phillip Dang.

Read more ...


Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs

16 Apr, 2024 by Douglas Jia.

Read more ...


Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama Model on a single AMD GPU

15, Apr 2024 by Sean Song.

Read more ...


Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU

15, Apr 2024 by Sean Song.

Read more ...


Small language models with Phi-2

8, Apr 2024 by Phillip Dang.

Read more ...


Scale AI applications with Ray

1, Apr 2024 by Vicky Tsang<vicktsan>, {hoverxref}Logan Grado, {hoverxref}Eliot Li.

Read more ...


Large language model inference optimizations on AMD GPUs

15, Mar 2024 by Seungrok Jung.

Read more ...


Building a decoder transformer model on AMD GPU(s)

12, Mar 2024 by Phillip Dang.

Read more ...


Question-answering Chatbot with LangChain on an AMD GPU

11, Mar 2024 by Phillip Dang.

Read more ...


Music Generation With MusicGen on an AMD GPU

8, Mar 2024 by Phillip Dang.

Read more ...


Simplifying deep learning: A guide to PyTorch Lightning

8, Feb 2024 by Phillip Dang.

Read more ...


Using LoRA for efficient fine-tuning: Fundamental principles

5, Feb 2024 by Sean Song.

Read more ...


Fine-tune Llama model with LoRA: Customizing a large language model for question-answering

1, Feb 2024 by Sean Song.

Read more ...


Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering

1, Feb 2024 by Sean Song.

Read more ...


Pre-training BERT using Hugging Face & TensorFlow on an AMD GPU

29, Jan 2024 by Vara Lakshmi Bayanagari.

Read more ...


Pre-training BERT using Hugging Face & PyTorch on an AMD GPU

26, Jan 2024 by Vara Lakshmi Bayanagari.

Read more ...


LLM distributed supervised fine-tuning with JAX

25 Jan, 2024 by Douglas Jia.

Read more ...


Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs

24 Jan, 2024 by Douglas Jia.

Read more ...