A Practical Guide to Running LLMs on AMD Radeon™ GPUs#

A Practical Guide to Running LLMs on AMD Radeon™ GPUs

Running large language models on AMD Radeon™ GPUs has never been more accessible or more exciting. Thanks to rapid advancements in open‑source tooling and GPU acceleration, both Radeon™ integrated GPUs (iGPU) and discrete GPUs (dGPU) have become powerful, cost‑effective platforms for local AI. Whether you prefer a polished desktop application, a lightweight command‑line workflow, or a fully customizable runtime, a rich ecosystem of tools now makes it easy to deploy cutting‑edge models on your system. With today’s software stack, you can run state‑of‑the‑art language models directly on your Radeon™‑powered PC, whether you’re using integrated graphics or a high‑performance discrete card.

This guide describes how to run LLMs on AMD Radeon™ GPUs using a range of partner frameworks, tools, and runtimes. We’ll walk through step‑by‑step setup instructions for several popular LLM environments and show you how to configure them for optimal performance on Radeon™ hardware.

Getting Things Ready#

LLM toolkit:

  • Lemonade — a user‑friendly launcher built for seamless AMD acceleration. Supports GGUF and ONNX model formats, as well as CPU, GPU, and NPU execution.

  • LM Studio — a desktop environment for downloading, serving, and interacting with models

  • Ollama — a simple command‑line and graphical user interface (GUI) tool with an extensive model library

  • llama.cpp — the low‑level, highly optimized foundation powering many local‑AI solutions. It also serves as the foundation for many higher-level LLM frameworks.

Each of the three approaches above offers unique strengths. From plug‑and‑play convenience to deep configurability, together they form a flexible toolkit for anyone looking to run fast, private, offline AI workloads on AMD GPUs.

Let’s dive in and explore what’s possible.

Prerequisites

Hardware requirements

Converting the Model#

If you already have a GGUF model, proceed to the Run LLMs on AMD Radeon™ GPUs Locally section.

This section explains how to convert a PyTorch checkpoint into the GGUF format so it can be used by the frameworks covered in this guide.

Lemonade, LM Studio, Ollama, and llama.cpp all support GGUF models, making it a unified format for running LLMs efficiently across different tools.

Note:

If you plan to use standard or built‑in model downloads provided by these frameworks, this conversion step can be skipped.

  1. Set up the environment.

conda create -n llm_convert python=3.12.4
conda activate llm_convert

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt
  1. Download the models from Hugging Face using Git LFS or the Hugging Face CLI.

Git LFS

git lfs install
git clone https://huggingface.co/microsoft/Phi-3.5-mini-instruct

Hugging Face CLI

pip install huggingface-hub
huggingface-cli download microsoft/Phi-3.5-mini-instruct --local-dir ./Phi-3.5-mini-instruct
  1. Format conversion to GGUF.

python convert_hf_to_gguf.py ./Phi-3.5-mini-instruct --outfile phi-3.5-mini-instruct-fp16.gguf

Note: The conversion script automatically detects the model architecture. For most modern models (including Phi-3.5, Llama, Mistral, etc.), no special tokenizer handling is required.

Run LLMs on AMD Radeon™ GPUs Locally#

This section walks through running LLMs on AMD Radeon™ GPUs using a range of partner frameworks, tools, and runtimes, with step‑by‑step setup instructions and configuration tips for optimal performance on Radeon hardware.

  • Lemonade — a user‑friendly launcher built for seamless AMD acceleration. Supports GGUF and ONNX model formats, as well as CPU, GPU, and NPU execution.

  • LM Studio — a desktop environment for downloading, serving, and interacting with models

  • Ollama — a simple command‑line and graphical user interface (GUI) tool with an extensive model library

  • llama.cpp — the low‑level, highly optimized foundation powering many local‑AI solutions. This is also a foundation for many higher level LLM frameworks

Lemonade#

Lemonade’s server application provides users with the ability to run GGUF and ONNX models under the same UI. It’s built on key technologies including AMD ROCm™ software, Llama.cpp, and ONNXGenAI frameworks. It also provides access to both the GPU and NPU for running LLMs.

Follow the instructions at https://lemonade-server.ai/ to install the Lemonade server on your local system and run LLMs.

Key features:

  • Text generation for natural language interfaces, agents, and tool driven workflows.

  • Image generation for content creation and visual feedback loops inside applications.

  • Speech-to-text and speech recognition for accessibility scenarios and hands-free interaction.

LM Studio#

Download LM Studio for your platform here.

The standard model can be downloaded using LM Studio’s model manager.

To use a custom generated GGUF model with LM Studio:

  1. Open the LM Studio app and go to the model folder.

  2. Go to models > My Models.

  3. Create the lmstudio-community folder if it doesn’t already exist.

  4. Create the phi-3.5-mini folder: lmstudio-community/phi-3.5-mini/.

  5. Copy your phi-3.5-mini-instruct-Q4_K_M.gguf model into this folder.

  6. Go back to My Models and refresh to ensure it appears.

  7. Load the model in chat and start chatting.

llama.cpp Command Line Tools#

If the Convert the model section has been completed and llama.cpp is set up, navigate to your llama.cpp directory.

Windows

Clone the repository.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Linux

Clone the repository.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Build with Vulkan Backend#

Prerequisites: Install the Vulkan SDK.

Windows

Build llama.cpp with Vulkan backend.

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

The compiled binaries will be in: build\bin\Release\ on Windows.

Linux

Build llama.cpp with Vulkan backend.

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

The compiled binaries will be in: build/bin/ on Linux.

Quantize the Model#

If you need to reduce your model’s memory footprint or disk size, you can quantize the weights to a lower bit‑width. Quantization converts the original FP16 weights into more compact representations—such as INT4—resulting in significantly lower RAM usage and faster loading times, with minimal impact on model quality for many use cases.

Windows

The following example demonstrates how to convert an FP16 model to a 4‑bit Q4_K_M format using llama.cpp:

cd llama.cpp
cmake -B build
cmake --build build --config Release
build\bin\Release\llama-quantize.exe phi-3.5-mini-instruct-fp16.gguf phi-3.5-mini-instruct-Q4_K_M.gguf Q4_K_M

Quantization Options:

  • Q4_K_M: 4-bit quantization, medium quality (recommended balance of size and quality)

  • Q4_K_S: 4-bit quantization, small size (more compression, slight quality loss)

  • Q5_K_M: 5-bit quantization, medium quality (better quality, larger file)

  • Q8_0: 8-bit quantization (highest quality, larger file)

After quantization, the new .gguf file will be much smaller and can be used directly at runtime with llama.cpp or any inference runtime that supports GGUF formats.

Linux

The following example demonstrates how to convert an FP16 model to a 4‑bit Q4_K_M format using llama.cpp:

cd llama.cpp
cmake -B build
cmake --build build --config Release
build/bin/llama-quantize phi-3.5-mini-instruct-fp16.gguf phi-3.5-mini-instruct-Q4_K_M.gguf Q4_K_M

Quantization Options:

  • Q4_K_M: 4-bit quantization, medium quality (recommended balance of size and quality)

  • Q4_K_S: 4-bit quantization, small size (more compression, slight quality loss)

  • Q5_K_M: 5-bit quantization, medium quality (better quality, larger file)

  • Q8_0: 8-bit quantization (highest quality, larger file)

After quantization, the new .gguf file will be much smaller and can be used directly at runtime with llama.cpp or any inference runtime that supports GGUF formats.

Run llama-cli with Your Model#

Windows

Basic Usage:

build\bin\Release\llama-cli.exe -m phi-3.5-mini-instruct-Q4_K_M.gguf -p "What is 2+2? Answer:" -ngl 33 -c 4096 -n 100

Interactive Chat Mode:

build\bin\Release\llama-cli.exe -m phi-3.5-mini-instruct-Q4_K_M.gguf -ngl 33 -c 4096 --interactive

With Phi-3.5 Chat Format:

build\bin\Release\llama-cli.exe -m phi-3.5-mini-instruct-Q4_K_M.gguf -p "<|system|>You are a helpful AI assistant.<|end|><|user|>What is the capital of France?<|end|><|assistant|>" -ngl 33 -c 4096 -n 100

Key Parameters:

  • -m: Path to your GGUF model file

  • -p: Your prompt text

  • -c: REQUIRED - Context window size (use 4096 or 8192, NOT the full 128K)

  • -ngl: Number of GPU layers to offload (33 for Phi-3.5-mini = all layers, or use -1)

  • -n: Maximum tokens to generate (default: 128, increase for longer responses)

  • --interactive: Enable interactive chat mode

Troubleshooting:

  • Error “ErrorOutOfDeviceMemory”: Reduce -c value (try 2048 or lower)

  • Slow performance: Ensure -ngl 33 is set to use GPU

  • Model not found: Use absolute path or ensure you’re in the correct directory

  • Context too large: Use -c 4096 instead of default (which tries to use full 131K)

Performance Tips:

  • -c 4096: Good balance (uses ~1.5GB VRAM for the KV cache)

  • -c 8192: Better for longer conversations (uses ~3GB VRAM for KV cache)

  • -c 2048: Use if GPU memory is limited

  • -ngl 33: Offloads all 33 layers to GPU (recommended)

  • Model requires ~2GB VRAM + additional memory for the context cache (varies with -c value)

Linux

Basic Usage:

build/bin/llama-cli -m phi-3.5-mini-instruct-Q4_K_M.gguf -p "What is 2+2? Answer:" -ngl 33 -c 4096 -n 100

Interactive Chat Mode:

build/bin/llama-cli -m phi-3.5-mini-instruct-Q4_K_M.gguf -ngl 33 -c 4096 --interactive

With Phi-3.5 Chat Format:

build/bin/llama-cli -m phi-3.5-mini-instruct-Q4_K_M.gguf -p "<|system|>You are a helpful AI assistant.<|end|><|user|>What is the capital of France?<|end|><|assistant|>" -ngl 33 -c 4096 -n 100

Key Parameters:

  • -m: Path to your GGUF model file

  • -p: Your prompt text

  • -c: REQUIRED - Context window size (use 4096 or 8192, NOT the full 128K)

  • -ngl: Number of GPU layers to offload (33 for Phi-3.5-mini = all layers, or use -1)

  • -n: Maximum tokens to generate (default: 128, increase for longer responses)

  • --interactive: Enable interactive chat mode

Troubleshooting:

  • Error “ErrorOutOfDeviceMemory”: Reduce -c value (try 2048 or lower)

  • Slow performance: Ensure -ngl 33 is set to use GPU

  • Model not found: Use absolute path or ensure you’re in the correct directory

  • Context too large: Use -c 4096 instead of default (which tries to use full 131K)

Performance Tips:

  • -c 4096: Good balance (uses ~1.5GB VRAM for KV cache)

  • -c 8192: Better for longer conversations (uses ~3GB VRAM for KV cache)

  • -c 2048: Use if GPU memory is limited

  • -ngl 33: Offloads all 33 layers to GPU (recommended)

  • Model requires ~2GB VRAM + plus additional memory for the context cache (varies with -c value)

Ollama#

Ollama provides a simple way to run large language models locally with ROCm support for AMD GPUs.

Installation#

Prerequisites for AMD GPU Support:

  • ROCm must be installed on your system (see the ROCm installation guide)

  • Ollama will automatically detect and use AMD GPUs with ROCm

Windows

  1. Download the Windows installer:

  2. Run the installer:

    # Double-click OllamaSetup.exe or run from PowerShell:
    Start-Process "$env:USERPROFILE\Downloads\OllamaSetup.exe"
    
  3. Follow the installation wizard:

    • Accept the license agreement

    • Choose installation location (default: C:\Program Files\Ollama)

    • Complete the installation

  4. Verify installation:

    ollama --version
    

Linux

curl -fsSL https://ollama.com/install.sh | sh

Running with ROCm Backend#

Windows

Ollama automatically detects and uses AMD GPUs with ROCm support.

Option A: Use a Model from Ollama Library

Pull and run a pre-configured model:

ollama run phi3.5

*Option B: Use Your Custom GGUF Model (from the “Converting the Model” section”)

Create a Modelfile to import your converted model:

  1. Create a Modelfile:

    # Create Modelfile in the same directory as your GGUF model
    FROM ./phi-3.5-mini-instruct-Q4_K_M.gguf
    
    TEMPLATE """<|system|>
    {{ .System }}<|end|>
    <|user|>
    {{ .Prompt }}<|end|>
    <|assistant|>
    """
    
    PARAMETER stop "<|end|>"
    PARAMETER stop "<|endoftext|>"
    
  2. Create the model in Ollama:

    ollama create phi3.5-custom -f Modelfile
    
  3. Run your custom model:

    ollama run phi3.5-custom
    

Example Output:

>>> What is 2+2?
The sum of 2 + 2 is 4. This arithmetic operation follows the basic principles
of addition, where combining two sets or instances of a number results in their
total when added together.

Useful Commands:

# List all models
ollama list

# Delete a model
ollama rm phi3.5-custom

# Show model information
ollama show phi3.5-custom

Multi-GPU Systems:

If you have multiple AMD GPUs, set the HIP_VISIBLE_DEVICES environment variable before running Ollama:

# Windows PowerShell
$env:HIP_VISIBLE_DEVICES="1"; ollama run phi3.5

# Linux
export HIP_VISIBLE_DEVICES=1
ollama run phi3.5

Available Models:

Visit https://ollama.com/library to browse hundreds of available models, including:

  • phi3.5, phi3, phi3:medium

  • llama3.1, llama3.2, llama2

  • mistral, mixtral

  • codellama, deepseek-coder

  • And many more

Linux

Ollama automatically detects and uses AMD GPUs with ROCm support.

Option A: Use a Model from Ollama Library

Pull and run a pre-configured model:

ollama run phi3.5

Option B: Use Your Custom GGUF Model from Section 1

Create a Modelfile to import your converted model:

  1. Create a Modelfile:

    # Create Modelfile in the same directory as your GGUF model
    FROM ./phi-3.5-mini-instruct-Q4_K_M.gguf
    
    TEMPLATE """<|system|>
    {{ .System }}<|end|>
    <|user|>
    {{ .Prompt }}<|end|>
    <|assistant|>
    """
    
    PARAMETER stop "<|end|>"
    PARAMETER stop "<|endoftext|>"
    
  2. Create the model in Ollama:

    ollama create phi3.5-custom -f Modelfile
    
  3. Run your custom model:

    ollama run phi3.5-custom
    

Example Output:

>>> What is 2+2?
The sum of 2 + 2 is 4. This arithmetic operation follows the basic principles
of addition, where combining two sets or instances of a number results in their
total when added together.

Useful Commands:

# List all models
ollama list

# Delete a model
ollama rm phi3.5-custom

# Show model information
ollama show phi3.5-custom

Multi-GPU Systems:

If you have multiple AMD GPUs, set the HIP_VISIBLE_DEVICES environment variable before running Ollama:

# Windows PowerShell
$env:HIP_VISIBLE_DEVICES="1"; ollama run phi3.5

# Linux
export HIP_VISIBLE_DEVICES=1
ollama run phi3.5

Available Models:

Visit https://ollama.com/library to browse hundreds of available models including:

  • phi3.5, phi3, phi3:medium

  • llama3.1, llama3.2, llama2

  • mistral, mixtral

  • codellama, deepseek-coder

  • And many more

Python Bindings#

Install llama-cpp-python with Vulkan Support#

Windows

set CMAKE_ARGS="-DGGML_VULKAN=on"
pip install llama-cpp-python

Linux

export CMAKE_ARGS="-DGGML_VULKAN=on"
pip install llama-cpp-python

Using Phi-3.5 chat format#

The examples below demonstrate the basic usage of the Phi-3.5 model on Windows and Linux using llama.cpp with Python.

Windows#
from llama_cpp import Llama

llm = Llama(
    model_path="phi-3.5-mini-instruct-Q4_K_M.gguf",
    n_gpu_layers=-1,
    n_ctx=4096,
    chat_format="chatml"  # Phi-3.5 uses ChatML format
)

# Chat completion
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=256,
    temperature=0.7
)

print(response['choices'][0]['message']['content'])

Key Parameters:

  • model_path: Path to your GGUF model

  • n_gpu_layers: Number of layers to offload to GPU (-1 = all layers)

  • n_ctx: Context window size (max 128000 for Phi-3.5, but 4096-8192 recommended)

  • chat_format: Set to “chatml” for Phi-3.5 models

  • temperature: Controls randomness (0.0 = deterministic, 1.0 = creative)

  • max_tokens: Maximum tokens to generate

Linux#
from llama_cpp import Llama

llm = Llama(
    model_path="phi-3.5-mini-instruct-Q4_K_M.gguf",
    n_gpu_layers=-1,
    n_ctx=4096,
    chat_format="chatml"  # Phi-3.5 uses ChatML format
)

# Chat completion
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=256,
    temperature=0.7
)

print(response['choices'][0]['message']['content'])

Key Parameters:

  • model_path: Path to your GGUF model

  • n_gpu_layers: Number of layers to offload to GPU (-1 = all layers)

  • n_ctx: Context window size (max 128000 for Phi-3.5, but 4096-8192 recommended)

  • chat_format: Set to “chatml” for Phi-3.5 models

  • temperature: Controls randomness (0.0 = deterministic, 1.0 = creative)

  • max_tokens: Maximum tokens to generate

Summary#

In this blog you explored how running large language models locally on AMD Radeon GPUs has moved from an experimental niche to a practical, flexible option for developers and enthusiasts. You learned what the ROCm-based software stack and companion tools like Lemonade, LM Studio, Ollama, and llama.cpp each bring to the table, how they map to different workflows (from lightweight on-device inference to multi-GPU training and fine-tuning), and why that choice matters for performance, privacy, and cost.

As both hardware and software continue to evolve, the AMD AI ecosystem is poised to make local AI even more capable and widely accessible. Now is the perfect time to explore what’s possible with Radeon-powered AI.

Additional Resources#

Disclaimers#

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.