Posts by Clint Greene

Enabling FlashInfer on ROCm for Accelerated LLM Serving

01 October 2025

FlashInfer is an innovative framework designed to accelerate inference of large language models (LLMs). Given the explosive growth and adoption of models like DeepSeek R1, Llama 3, and Qwen 3, efficient inference is critical to meet the demands of real-world deployment. However, challenges such as GPU memory bottlenecks, throughput limitations, and latency remain significant hurdles for deploying these models at scale.

Read more ...

Technical Dive into AMD’s MLPerf Inference v5.1 Submission

09 September 2025

In the rapidly evolving landscape of artificial intelligence, the demand for reliable and efficient model inference has never been greater. With advancements in large language models (LLMs) and a growing reliance on real-time applications, benchmarks are critical in evaluating how well AI systems perform under varying conditions. Enter MLPerf Inference: Datacenter v5.1 — a significant update to the well-respected benchmarking suite that assesses inference performance across a wide array of models and use cases, catering especially to data centers.

Read more ...

Slim Down Your Llama: Pruning & Fine-Tuning for Maximum Performance

09 September 2025

In this blog, we demonstrate how quantization, intelligent depth pruning and supervised fine-tuning can dramatically improve the inference performance of Meta’s Llama 3.1 405B model on AMD Instinct MI355X GPUs. By applying quantization and reducing the number of layers from the original 126, we are able to decrease memory requirements and boost token throughput. Additionally, with carefully applied fine-tuning, we maintain high inference accuracy for both RougeL and Exact Match metrics on MLPerf workloads. To see how these optimizations fit into AMD’s broader MLPerf Inference v5.1 efforts, read Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.1 Submission. For a detailed technical breakdown into other optimizations, check out our Technical Dive into AMD’s MLPerf Inference v5.1 Submission.

Read more ...

Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.1 Submission

09 September 2025

MLPerf Inference v5.1 marks AMD’s third round of submissions and the most ambitious yet. This round features submissions on AMD Instinct MI325X and MI355X systems, including multi-node inference and models in MXFP4 datatype. Building upon the success in MLPerf Inference v5.0, AMD has submitted improved results for Llama 2 70B and SDXL on the MI325X platform in this round using new optimization techniques. For a deeper look at these optimizations, see our Technical Dive into AMD’s MLPerf Inference v5.1 Submission. Additionally, explore how we optimized Llama 3.1 405B through pruning and fine-tuning in Slim Down Your Llama: Pruning & Fine-Tuning for Maximum Performance. In addition, AMD has made submissions for the following workloads:

Read more ...

Accelerating Video Generation on ROCm with Unified Sequence Parallelism: A Practical Guide

11 July 2025

Video generation models like HunyuanVideo and Wan 2.1 are rapidly improving, producing high-fidelity text-to-video and image-to-video outputs. These models generate content with such realism that distinguishing synthetic videos from real ones is increasingly difficult. At the core of this progress lies diffusion-based generative modeling, which has evolved from traditional U-Net–style convolutional encoder-decoders to more powerful Diffusion Transformers (DiTs). This architectural shift enables better modeling of complex spatial-temporal dependencies across frames, addressing key limitations in earlier designs.

Read more ...

Aligning Mixtral 8x7B with TRL on AMD GPUs

12 June 2025

Building a ChatGPT-like assistant is a multi-step process that starts with pre-training a large language model (LLM) on internet-scale data across clusters of thousands of GPUs, resulting in what is known as a “base model”. This base model is then refined through an instruction based supervised fine-tuning (SFT) process, which trains it to function as a useful digital assistant capable of understanding and responding accurately to a wide range of queries. Finally, human preference alignment is applied to enhance the model’s friendliness, helpfulness, and safety, ensuring that interactions are not only informative but also pleasant for users. This combination of techniques creates a sophisticated assistant that is both powerful and user-centric—exemplified by AMD’s new Instella-Long assistant.

Read more ...

Supercharging JAX with Triton Kernels on AMD GPUs

09 October 2024

Ready to supercharge your deep learning applications on AMD GPUs? In this blog, we’ll show you how to develop a custom fused dropout activation kernel for matrices in Triton, seamlessly call it from JAX, and benchmark its performance with ROCm. This powerful combination will take your model’s performance to the next level.

Read more ...

Fine-tuning Llama 3 with Axolotl using ROCm on AMD GPUs

23 September 2024

Large Language Models (LLMs) have revolutionized the field of natural language processing, enabling machines to understand and generate human-like language. However, these models are often trained on vast amounts of general-purpose data, which can make them less effective for specific tasks or domains. Fine-tuning involves training a pre-trained LLM on a specialized dataset to enhance its performance on specific tasks. As Andrej Karpathy analogized, this process is akin to allowing someone to practice a particular skill. Just as a person might need to practice a skill in a specific context to become proficient, an LLM needs to be fine-tuned on a specific dataset to become proficient in a particular task. For instance, an LLM can be fine-tuned for tasks such as financial forecasting, technical support, legal advising, medical diagnosis, or even instruction following. By fine-tuning an LLM, organizations can achieve better results and improve information security by limiting the exposure of sensitive data.

Read more ...

Inferencing and serving with vLLM on AMD GPUs

19 September 2024

09 June 2025

In the rapidly evolving field of artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools for understanding and generating human-like text. However, deploying these models efficiently at scale presents significant challenges. This is where vLLM comes into play. vLLM is an innovative open-source library designed to optimize the serving of LLMs using advanced techniques. Central to vLLM is PagedAttention, a novel algorithm that enhances the efficiency of the model’s attention mechanism by managing it as virtual memory. This approach optimizes GPU memory utilization, facilitating the processing of longer sequences and enabling more efficient handling of large models within existing hardware constraints. Additionally, vLLM incorporates continuous batching to maximize throughput and minimize latency. By leveraging these cutting-edge techniques, vLLM significantly improves the performance and scalability of LLM deployment, allowing organizations to harness the power of state-of-the-art AI models more effectively and economically.

Read more ...

Enhancing vLLM Inference on AMD GPUs

19 September 2024

09 June 2025

In this blog, we’ll demonstrate the latest performance enhancements in vLLM inference on AMD Instinct accelerators using ROCm 6.2. In a nutshell, vLLM optimizes GPU memory utilization, allowing more efficient handling of large language models (LLMs) within existing hardware constraints, maximizing throughput and minimizing latency. We start the blog by briefly explaining how causal language models like Llama 3 and ChatGPT generate text, motivating the need to enhance throughput and reduce latency. If you’re new to vLLM, we also recommend reading our introduction to Inferencing and serving with vLLM on AMD GPUs. ROCm 6.2 introduces support for the following vLLM features which we will use in this blog post.

Read more ...

Accelerating Large Language Models with Flash Attention on AMD GPUs

15 May 2024

15, May 2024 by

.

Read more ...

Inferencing with Mixtral 8x22B on AMD GPUs

01 May 2024

1, May 2024 by

.

Read more ...

Speech-to-Text on an AMD GPU with Whisper

16 April 2024

16 Apr, 2024 by

.

Read more ...

Developing Triton Kernels on AMD GPUs

15 April 2024

19 March 2025

OpenAI has developed a powerful GPU focused programming language and compiler called Triton that works seamlessly with AMD GPUs. The goal of Triton is to enable AI engineers and scientists to write high-performant GPU code with minimal expertise. Triton kernels are performant because of their blocked program representation, allowing them to be compiled into highly optimized binary code. Triton also leverages Python for kernel development, making it both familiar and accessible. And the kernels can be easily compiled by simply declaring the triton.jit python decorator before the kernel.

Read more ...

Retrieval Augmented Generation (RAG) using LlamaIndex

04 April 2024

4, Apr 2024 by

.

Read more ...

Accelerating XGBoost with Dask using multiple AMD GPUs

26 January 2024

26 Jan, 2024 by

.

Read more ...