Posts tagged Fine-Tuning

VLM Fine-Tuning for Robotics on AMD Enterprise AI Suite

28 November 2025

Vision-language models (VLMs) power applications from image captioning to robotics instruction following, but full model fine-tuning is resource-intensive and slow. Low-Rank Adaptation (LoRA) offers a faster, more efficient alternative by training only a small set of injected parameters while keeping the base model frozen.

Read more ...

Fine-Tune LLMs for Proteins with AMD Enterprise AI Suite

27 November 2025

Want to teach a large language model to understand protein sequences? ROCm has got you covered.

Read more ...

Using Reinforcement Learning to Fix Text in AI-Generated Videos

25 November 2025

One common giveaway that a video is AI-generated is the text. Letters may look slightly malformed or nonsensical, words can be misspelled and full sentences can have grammatical errors. Improving text generation in videos isn’t just a cosmetic issue - it is essential to generate the prompted text precisely, lest the message become confusing, unprofessional, and potentially misleading. This is an excellent case for leveraging reinforcement learning to improve a video generation model on a specific task without requiring massive amounts of suitable training data.

Read more ...

Technical Dive into AMD MLPerf Training v5.1 Submission

12 November 2025

MLPerf Training v5.1 was released on November 12th 2025 and for this round, AMD has showcased its newest GPUs and added a new benchmark. The highlights of this round include:

Read more ...

Reproducing AMD MLPerf Training v5.1 Submission Result

12 November 2025

Building upon the success of the MLPerf Training v5.0 submission, AMD has not only submitted improved results for the Llama 2 70B LoRA finetuning benchmark for the MI300X and MI325X platforms in the v5.1 round, but also for the MI350X and MI355X platforms. In addition, AMD has submissions for the newly added Llama 3.1 8B pretraining benchmark in this MLPerf Training round. The AMD submissions are summarized in the following table:

Read more ...

Training AI Weather Forecasting Models on AMD Instinct

10 November 2025

Weather forecasting is one of the most computationally intensive scientific challenges and an essential societal need. Predicting extreme weather events, agricultural and energy planning and daily forecasts all require accurate weather predictions. Traditionally, Numerical Weather Prediction (NWP) has served as the foundation of weather forecasting by solving complex physical equations that require significant computational power. However, recent advances in machine learning have led to the development of alternative prediction models that reduce computational costs by orders of magnitude, while either maintaining or improving accuracy in forecasts. Models like GenCast [1], Pangu-Weather [2], Aurora [3] and others have shown promising results in this area (see the WeatherBench [4] scorecard). Running inference on these models using AMD GPUs is straightforward, as highlighted in our recent blog post: Running SOTA AI-based Weather Forecasting models on AMD Instinct.

Read more ...

Day-0 Support for the SGLang-Native RL Framework - slime on AMD Instinct™ GPUs

25 September 2025

AMD is excited to provide Day-0 support for the SGLang-native RL framework, slime. In this post, we will provide more details about our support and optimizations, as well as slime’s benefits for large-scale RL training. First, we describe the engineering efforts behind slime—including codebase modification, kernel-level memory management for ROCm™ software, and modifications to third-party dependencies (Megatron-LM, SGLang, and torch_memory_saver)—as well as Docker images that enable efficient execution on AMD Instinct™ GPUs. Architecturally, slime supports two training modes: synchronous and asynchronous. Across these modes, we additionally present system-level optimizations with the corresponding use cases. Specifically, in the synchronous setting, our rollout optimizations deliver a 40% throughput improvement over the one without it on AMD Instinct™ GPUs. In the asynchronous setting, we develop a multi-turn RL agent framework to train the kernel generation model. You can also read more about this support in the MLsys – SGLang official blog.

Read more ...

Exploring Use Cases for Scalable AI: Implementing Ray with ROCm Support for Efficient ML Workflows

10 September 2025

In this blog, you will learn how to use Ray to easily scale your AI applications from your laptop to multiple AMD GPUs.

Read more ...

Slim Down Your Llama: Pruning & Fine-Tuning for Maximum Performance

09 September 2025

In this blog, we demonstrate how quantization, intelligent depth pruning and supervised fine-tuning can dramatically improve the inference performance of Meta’s Llama 3.1 405B model on AMD Instinct MI355X GPUs. By applying quantization and reducing the number of layers from the original 126, we are able to decrease memory requirements and boost token throughput. Additionally, with carefully applied fine-tuning, we maintain high inference accuracy for both RougeL and Exact Match metrics on MLPerf workloads. To see how these optimizations fit into AMD’s broader MLPerf Inference v5.1 efforts, read Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.1 Submission. For a detailed technical breakdown into other optimizations, check out our Technical Dive into AMD’s MLPerf Inference v5.1 Submission.

Read more ...

Wan2.2 Fine-Tuning: Tailoring an Advanced Video Generation Model on a Single GPU

19 August 2025

This blog post will guide you through fine-tuning Wan2.2 - a state-of-the-art video generation model - on a single AMD Instinct MI300X GPU. By following this guide, you’ll unlock Wan2.2’s advanced video generation capabilities and customize the output — whether in a unique artistic style or a specialized domain — all while running memory efficiently even on a single GPU. Here are some examples of how you can put this guide into practice:

Read more ...

Chain-of-Thought Guided Visual Reasoning Using Llama 3.2 on a Single AMD Instinct MI300X GPU

21 July 2025

In this post, we will show you how to fine-tune the Llama 3.2 Vision Instruct models, specifically the 11B and 90B parameter variants, on a synthetic multi-modal dataset using torchtune. This blog focuses on chain-of-thought (CoT) guided visual reasoning, a technique where the model is encouraged to articulate intermediate reasoning steps before arriving at a final answer. By incorporating the CoT approach, we aim to improve the model’s interpretability and accuracy in tasks that require multi-step understanding of visual inputs. By utilizing the high-bandwidth memory (HBM) of the AMD Instinct™ MI300X GPU, we aim to enhance the model’s vision understanding, particularly for interpreting charts, all on a single GPU provided by TensorWave. Our evaluation shows that we can train an 11B parameter model to perform with 2.3x better accuracy than a 90B parameter model. The blog will walk you through our dataset preparation, model configuration, training recipes, and evaluation—all optimized to run on a single GPU.

Read more ...

Continued Pretraining: A Practical Playbook for Language-Specific LLM Adaptation

18 June 2025

What if you could make a state-of-the-art LLM fluent in a new language—without training from scratch? In this guide, we show how we did just that with Finnish.

Read more ...

Aligning Mixtral 8x7B with TRL on AMD GPUs

12 June 2025

Building a ChatGPT-like assistant is a multi-step process that starts with pre-training a large language model (LLM) on internet-scale data across clusters of thousands of GPUs, resulting in what is known as a “base model”. This base model is then refined through an instruction based supervised fine-tuning (SFT) process, which trains it to function as a useful digital assistant capable of understanding and responding accurately to a wide range of queries. Finally, human preference alignment is applied to enhance the model’s friendliness, helpfulness, and safety, ensuring that interactions are not only informative but also pleasant for users. This combination of techniques creates a sophisticated assistant that is both powerful and user-centric—exemplified by AMD’s new Instella-Long assistant.

Read more ...

Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration

24 April 2025

In this blog post, we provide an overview of Volcano Engine Reinforcement Learning for LLMs (verl) and discuss its benefits in large-scale reinforcement learning from human feedback (RLHF). We also detail the modifications made to the codebase to optimize verl’s performance on AMD Instinct GPUs. Next, we walk through the process of building the Docker image using a Dockerfile on the user side, along with training scripts tailored for both single-node and multi-node setups. Lastly, we present verl’s performance results, focusing on throughput and convergence accuracy achieved on AMD Instinct™ MI300X GPUs. Follow this guide to get started with verl on AMD Instinct GPUs and accelerate your RLHF training with ROCm-optimized performance.

Read more ...

Hands-On with CK-Tile: Develop and Run Optimized GEMM on AMD GPUs

15 April 2025

Composable Kernel (CK-Tile) for ROCm is used to build portable high-performance kernels for accelerating computing, e.g. HPC, DL and LLMs for training and inference workloads. CK-Tile APIs consist of vendor optimized kernels like GEMM, BatchGemm, fused-MHA, fused-MoE, SmoothQuant, element-wise kernels and many other kernels. This blog focuses on creating the most commonly used GEMM kernel, incorporating a vendor-optimized kernel pipeline and policies, and covers key CK-Tile concepts for quick learning.

Read more ...

Analyzing the Impact of Tensor Parallelism Configurations on LLM Inference Performance

14 March 2025

As AI models continue to scale in size and complexity, deploying them efficiently requires strategic resource allocation. Tensor parallelism (TP) is a valuable technique for distributing workloads across multiple GPUs, reducing memory constraints, and enabling inference for large-scale models. However, the choice of TP configuration isn’t one-size-fits-all—it directly impacts performance, networking overhead, and cost efficiency.

Read more ...

Optimized ROCm Docker for Distributed AI Training

13 March 2025

This blog will introduce you to the updated AMD Docker image, specifically built and optimized for distributed training. As you will see, the optimized AMD ROCm Docker image makes training large AI models faster and more efficient. It includes updates such as better fine-tuning tools, improved performance for multi-GPU setups, and support for FP8 precision, which helps speed up training while using less memory, and can provide you with an overall smoother and more efficient training experience on popular models such as Flux and Llama 3.1 running on AMD GPUs.

Read more ...

Introducing Instella: New State-of-the-art Fully Open 3B Language Models

05 March 2025

AMD is excited to announce Instella, a family of fully open state-of-the-art 3-billion-parameter language models (LMs) trained from scratch on AMD Instinct™ MI300X GPUs. Instella models outperform existing fully open models of similar sizes and achieve competitive performance compared to state-of-the-art open-weight models such as Llama-3.2-3B, Gemma-2-2B, and Qwen-2.5-3B, including their instruction-tuned counterparts.

Read more ...

Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU

21 February 2025

In this blog, we explore how DeepSeek-R1 achieves competitive performance on AMD Instinct™ MI300X GPUs, along with performance comparisons to H200 and a short demo application showcasing real-world usage. By leveraging MI300X, users can deploy DeepSeek-R1 and V3 models on a single node with impressive efficiency. In just two weeks, optimizations using SGLang have unlocked up to a 4X boost in inference speed, ensuring efficient scaling, lower latency, and optimized throughput. The MI300X’s high-bandwidth memory (HBM) and compute power enable execution of complex AI workloads, handling longer sequences and demanding reasoning tasks. With AMD and the SGLang community driving ongoing optimizations—including fused MoE kernels, MLA kernel fusion, and speculative decoding—MI300X is set to deliver an even more powerful AI inference experience.

Read more ...

PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm

09 February 2025

PyTorch Fully Sharded Data Parallel (FSDP) is a data parallelism technique that enables the training of large-scale models in a memory-efficient manner. FSDP achieves this memory efficiency by sharding model parameters, optimizer states, and/or gradients across GPUs, reducing the memory footprint required by each GPU. This enables the training of large-scale models with lower total GPU memory than DDP (Distributed Data Parallel), in which the model weights and optimizer states are replicated across all processes. To learn more about DDP, refer to Distributed Data Parallel (DDP) training on AMD GPU with ROCm.

Read more ...

Best practices for competitive inference optimization on AMD Instinct™ MI300X GPUs

29 January 2025

Optimizing LLM performance on GPUs is challenging due to diverse model needs, memory constraints, and balancing latency and throughput. This document examines how hardware utilization, memory and communication bandwidth and scaling, contribute to inference performance, detailing optimal configurations for AMD Instinct™ MI300X GPUs.

Read more ...

Distributed fine-tuning of MPT-30B using Composer on AMD GPUs

28 January 2025

Composer, developed by MosaicML, is an open-source deep learning training library built on top of PyTorch, designed to simplify and optimize distributed training workflows. It supports scalable training on multiple nodes and efficiently handles datasets of various sizes. Composer integrates advanced techniques such as PyTorch Fully Sharded Data Parallelism (FSDP), elastic sharded checkpointing, training callbacks, and speed-up algorithms to enhance training performance and flexibility. It closely resembles PyTorch’s torchrun and has demonstrated exceptional efficiency when scaling to hundreds of GPUs.

Read more ...

Quantized 8-bit LLM training and inference using bitsandbytes on AMD GPUs

13 November 2024

In this blog post we will cover the bitsandbytes 8-bit representations. As you will see, the bitsandbytes 8-bit representations significantly help reduce the memory needed for fine-tuning and inferencing LLMs. There are many quantization techniques used in the field to decrease a model size, but bitsandbytes offers quantization to decrease the size of optimizer states as well. This post will help you understand the basic principles underlying the bitsandbytes 8-bit representations, explain the bitsandbytes 8-bit optimizer and LLM.int8 techniques, and show you how to implement these on AMD GPUs using ROCm.

Read more ...

Inference with Llama 3.2 Vision LLMs on AMD GPUs Using ROCm

23 October 2024

Meta’s Llama models now support multimodal capabilities, expanding their functionality beyond traditional text-only applications. The Llama 3.2 models are available in a range of sizes, including medium-sized 11B and 90B multimodal models for vision-text reasoning tasks, and lightweight 1B and 3B text-only models designed for edge and mobile devices.

Read more ...

Multinode Fine-Tuning of Stable Diffusion XL on AMD GPUs with Hugging Face Accelerate and OCI’s Kubernetes Engine (OKE)

15 October 2024

As the scale and complexity of generative AI and deep learning models grow, multinode training, basically dividing a training job across several processors, has become an essential strategy to speed up training and fine-tuning processes of large generative AI models like SDXL. By distributing the training workload across multiple GPUs on multiple nodes, multinode setups can significantly accelerate the training process. In this blog post we will show you, step-by step, how to set-up and fine-tune a Stable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI) Kubernetes Engine (OKE) on AMD GPUs using ROCm.

Read more ...

Table Question-Answering with TaPas

26 April 2024

26 Apr, 2024 by

.

Read more ...

Multimodal (Visual and Language) understanding with LLaVA-NeXT

26 April 2024

26, Apr 2024 by

.

Read more ...

Unlocking Vision-Text Dual-Encoding: Multi-GPU Training of a CLIP-Like Model

24 April 2024

24 Apr, 2024 by

.

Read more ...

Text Summarization with FLAN-T5

16 April 2024

In this blog, we showcase the language model FLAN-T5 and how to fine-tune it on a summarization task with HuggingFace in an AMD GPUs + ROCm system.

Read more ...

Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs

16 April 2024

16 Apr, 2024 by

.

Read more ...

Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama Model on a single AMD GPU

15 April 2024

15, Apr 2024 by

.

Read more ...

Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU

15 April 2024

15, Apr 2024 by

.

Read more ...

Small language models with Phi-2

08 April 2024

Like many other LLMs, Phi-2 is a transformer-based model with a next-word prediction objective that is trained on billions of tokens. At 2.7 billion parameters, Phi-2 is a relatively small language model, but it achieves outstanding performance on a variety of tasks, including common sense reasoning, language understanding, math, and coding. For reference, GPT 3.5 has 175 billion parameters and the smallest version of LLaMA-2 has 7 billion parameters. According to Microsoft, Phi-2 is capable of matching or outperforming models up to 25 times larger due to more carefully curated training data and model scaling.

Read more ...

Scale AI applications with Ray

01 April 2024

1, Apr 2024 by

Logan Grado, {hoverxref}Eliot Li.

Read more ...

Large language model inference optimizations on AMD GPUs

15 March 2024

15, Mar 2024 by

.

Read more ...

Building a decoder transformer model on AMD GPU(s)

12 March 2024

12, Mar 2024 by

.

Read more ...

Question-answering Chatbot with LangChain on an AMD GPU

11 March 2024

11, Mar 2024 by

.

Read more ...

Music Generation With MusicGen on an AMD GPU

08 March 2024

MusicGen is an autoregressive, transformer-based model that predicts the next segment of a piece of music based on previous segments. This is a similar approach to language models predicting the next token.

Read more ...

Simplifying deep learning: A guide to PyTorch Lightning

08 February 2024

8, Feb 2024 by

.

Read more ...

Using LoRA for efficient fine-tuning: Fundamental principles

05 February 2024

5, Feb 2024 by

.

Read more ...

Fine-tune Llama model with LoRA: Customizing a large language model for question-answering

01 February 2024

1, Feb 2024 by

.

Read more ...

Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering

01 February 2024

1, Feb 2024 by

.

Read more ...

Pre-training BERT using Hugging Face & TensorFlow on an AMD GPU

29 January 2024

29, Jan 2024 by

.

Read more ...

Pre-training BERT using Hugging Face & PyTorch on an AMD GPU

26 January 2024

26, Jan 2024 by

.

Read more ...

LLM distributed supervised fine-tuning with JAX

25 January 2024

25 Jan, 2024 by

.

Read more ...

Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs

24 January 2024

24 Jan, 2024 by

.

Read more ...