Posts tagged AI/ML

STX-B0T: Real-time AI Robot Assistant Powered by RyzenAI and ROCm

23 October 2025

In this blog post, we introduce STX-B0T, our human-responsive social robot prototype powered by AMD Ryzen AI’s CVMLSDK, the Strix-Halo (STX-Halo) APU, and built with open-source hardware.

Read more ...

Empowering Developers to Build a Robust PyTorch Ecosystem on AMD ROCm™ with Better Insights and Monitoring

21 October 2025

The PyTorch ecosystem is a vibrant and expansive collection of tools, libraries, and community-driven projects that enhance and extend the core PyTorch framework. It empowers researchers and developers to build, train, and deploy deep learning models across a wide range of domains with flexibility and efficiency.

Read more ...

ROCm 7.9 Technology Preview: ROCm Core SDK and TheRock Build System

20 October 2025

ROCm Core SDK 7.9.0 is a technology preview release built by TheRock. The Core SDK is a slimmed down version of the traditional ROCm release, focusing on foundational components for running high performance AI workloads on AMD GPUs. TheRock introduces a streamlined uniform system for building and testing these core components. TheRock manages dependencies between packages, so changes affecting several ROCm components can be coordinated automatically. Python releases built using TheRock can be easily installed in virtual environments, so developers can try new ROCm versions and features quickly without making system changes.

Read more ...

Kimi-K2-Instruct: Enhanced Out-of-the-Box Performance on AMD Instinct MI355 Series GPUs

16 October 2025

Learn how to boost AI inference performance with the Kimi-K2-Instruct model on AMD Instinct MI355 Series GPUs. This blog highlights benchmark results against B200 GPUs, focusing on faster time to first token (TTFT), lower latency, and higher throughput. You’ll also see how MI355X Series GPUs excel in high-concurrency workloads thanks to their larger memory capacity. By the end you’ll know how to evaluate and deploy MI355X GPUs with SGLang to scale demanding applications efficiently.

Read more ...

Gumiho: A New Paradigm for Speculative Decoding — Earlier Tokens in a Draft Sequence Matter More

14 October 2025

Speculative decoding has emerged as a promising approach to accelerate large language model (LLM) inference, yet existing methods face a tradeoff: parallel designs achieve higher speed but lose accuracy, while serial designs gain accuracy at the cost of efficiency. In our recent paper Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding, we introduce a new paradigm that addresses this bottleneck by prioritizing accuracy on the earliest draft tokens, which matters most for downstream acceptance. In this blog, we will discuss the motivation behind Gumiho, the theoretical foundation showing why early-token accuracy dominates, and the novel hybrid architecture that combines serial and parallel decoding to realize these insights. Our goal is to demonstrate both the scientific contributions and practical benefits of Gumiho, showing how it delivers state-of-the-art performance on AMD GPUs using the ROCm software stack, ensuring that the method is widely accessible and optimized for real-world deployment.

Read more ...

GEMM Tuning within hipBLASLt– Part 2

09 October 2025

This post continues from Part 1 where we introduced GEMM tuning concepts in hipBLASLt and explored the basics of solution search. In Part 2, we focus on offline tuning with the hipblaslt-bench tool. This workflow allows developers to benchmark candidate GEMM kernels for specific problem shapes, capture the best-performing solutions, and reuse them at runtime without rebuilding or modifying the hipBLASLt library.

Read more ...

Medical Imaging on MI300X: Optimized SwinUNETR for Tumor Detection

07 October 2025

This blog is part of a series of walkthroughs of Life Science AI models, stemming from this article.

Read more ...

Announcing MONAI 1.0.0 for AMD ROCm: Breakthrough AI Acceleration for Medical Imaging Models on AMD Instinct™ GPUs

07 October 2025

Today, AMD is thrilled to announce MONAI 1.0.0 for AMD ROCm, now available to the community as part of the ROCm-LS Early Access release. In this blog you will learn how to use MONAI to load an analyze medical images, and see results from more advanced case studies.

Read more ...

Optimizing FP4 Mixed-Precision Inference with Petit on AMD Instinct MI250 and MI300 GPUs: A Developer’s Perspective

06 October 2025

Haohui Mai is affiliated with the company CausalFlow.ai.

Read more ...

From Ingestion to Inference: RAG Pipelines on AMD GPUs

02 October 2025

Retrieval-Augmented Generation (RAG) is a machine learning architecture that enhances Large Language Models (LLMs) by combining generation with information retrieval from external sources. It was introduced to address the limitations of traditional LLMs by allowing them to access and utilize up-to-date information from internal and/or external knowledge bases. When a query is received, RAG first retrieves relevant documents or information from its knowledge bases, then uses this retrieved context alongside the query to generate more accurate and informed responses. This approach helps reduce hallucinations (making up information) common in standard LLMs, while also enabling the model to access current information not present in its original training data. RAG has become particularly valuable in enterprise applications, such as customer support systems, research assistants, and documentation tools, where accuracy and verifiable information are crucial.

Read more ...

GPU Partitioning Made Easy: Pack More AI Workloads Using AMD GPU Operator

01 October 2025

Modern AI workloads often don’t utilize the full capacity of advanced GPU hardware, especially when running smaller models or during development phases. The AMD GPU partitioning feature addresses this challenge by allowing you to divide physical GPUs into multiple virtual GPUs, dramatically improving resource utilization and cost efficiency.

Read more ...

Enabling FlashInfer on ROCm for Accelerated LLM Serving

01 October 2025

FlashInfer is an innovative framework designed to accelerate inference of large language models (LLMs). Given the explosive growth and adoption of models like DeepSeek R1, Llama 3, and Qwen 3, efficient inference is critical to meet the demands of real-world deployment. However, challenges such as GPU memory bottlenecks, throughput limitations, and latency remain significant hurdles for deploying these models at scale.

Read more ...

Matrix Core Programming on AMD CDNA™3 and CDNA™4 architecture

30 September 2025

In this blog post, we walk through how to use Matrix Cores in HIP kernels, with a focus on low-precision data types such as FP16, FP8, and FP4, as well as the new family of Matrix Core instructions with exponent block scaling introduced in the AMD CDNA™4 architecture. Through code examples and illustrations, we provide the necessary knowledge to start programming Matrix Cores, covering modern low-precision floating-point types, the Matrix Core compiler intrinsics, and the data layouts required by the Matrix Core instructions.

Read more ...

Coding Agents on AMD GPUs: Fast LLM Pipelines for Developers

30 September 2025

The rapid rise of AI-assisted development is transforming how software is built, with coding agents emerging as powerful tools for modern developers. In this blog, we will show you how to deploy coding agents on AMD GPUs using frameworks such as SGLang, vLLM, and llama.cpp, and walk through a practical workflow example: creating a Minesweeper game using Aider.

Read more ...

Day-0 Support for the SGLang-Native RL Framework - slime on AMD Instinct™ GPUs

25 September 2025

AMD is excited to provide Day-0 support for the SGLang-native RL framework, slime. In this post, we will provide more details about our support and optimizations, as well as slime’s benefits for large-scale RL training. First, we describe the engineering efforts behind slime—including codebase modification, kernel-level memory management for ROCm™ software, and modifications to third-party dependencies (Megatron-LM, SGLang, and torch_memory_saver)—as well as Docker images that enable efficient execution on AMD Instinct™ GPUs. Architecturally, slime supports two training modes: synchronous and asynchronous. Across these modes, we additionally present system-level optimizations with the corresponding use cases. Specifically, in the synchronous setting, our rollout optimizations deliver a 40% throughput improvement over the one without it on AMD Instinct™ GPUs. In the asynchronous setting, we develop a multi-turn RL agent framework to train the kernel generation model. You can also read more about this support in the MLsys – SGLang official blog.

Read more ...

Accelerating Audio-Driven Video Generation: WAN2.2-S2V on AMD ROCm

24 September 2025

Audio-driven video generation is rapidly evolving, opening new possibilities for creative content and intelligent automation. In this blog, we showcase how AMD Instinct MI300X GPUs and the ROCm software stack empower cutting-edge models like Wan2.2-S2V to deliver high-quality, expressive character animation at scale.

Read more ...

A Simple Design for Serving Video Generation Models with Distributed Inference

24 September 2025

Video generation is entering a new era, powered by diffusion models that deliver photorealistic and temporally consistent results from text prompts. Models like Wan2.2 push the boundaries of what’s possible in AI-generated content, but to make them practical, inference performance needs to scale in real-world terms: handling more simultaneous users, keeping response times reasonable, and efficiently using multiple GPUs or compute nodes.

Read more ...

Optimizing Drug Discovery Tools on AMD MI300X Part 1: Molecular Design with REINVENT

19 September 2025

This blog is part of a series of walkthroughs of Life Science AI models, stemming from this article which was a collaborative effort between AstraZeneca and AMD. The series delves into what was required in order to run drug discovery related AI workloads on AMD MI300X. This blog, in particular, looks at REINVENT4, a powerful molecular design tool that leverages advanced algorithms for de novo design, scaffold hopping, R-group replacement, linker design, and molecule optimization.

Read more ...

An Introduction to Primus-Turbo: A Library for Accelerating Transformer Models on AMD GPUs

19 September 2025

With the rapid growth of large-scale models, acceleration libraries are facing higher demands: they must deliver exceptional performance, offer comprehensive functionality, and remain easy to use. To meet these needs, we introduce Primus-Turbo — part of the Primus product family (see our previous blog for background). Primus-Turbo is designed around three core principles: performance, completeness, and ease of use. It supports training, inference, and a wide range of application scenarios, providing developers with a solid foundation to efficiently build and optimize large models on the ROCm platform. See Figure 1 below for a comprehensive stack coverage of Primus-Turbo.

Read more ...

Running SOTA AI-based Weather Forecasting models on AMD Instinct

18 September 2025

Weather Forecasting is a complex scientific problem where immense progress has been made through the Numerical Weather Prediction (NWP) approach using computational fluid dynamics-based models. Forecasting is usually done in three stages: a data assimilation stage where all available data streams at the time \(t\) (sometimes previous times can be used to improve this estimate) are used to estimate the current 3D state of the atmosphere \( S_{t}\) (surface and atmosphere), as parameterized by a number of variables at the current time \( t\), a forecasting stage where the state \(\hat{S}_{t + \delta t}\) for a later time \( t+ \delta t\) (i.e., all the variables at this later time) are forecasted, and a downstream stage where the forecasted state at time \(t + \delta t\) is used to estimate weather variables at more specific times and locations.

Read more ...

AMD-HybridLM: Towards Extremely Efficient Hybrid Language Models

17 September 2025

The rapid rise of deep learning applications has intensified the demand for language models that offer a balance between accuracy and efficiency—especially in settings constrained by memory, compute, or real-time requirements. While Transformer-based models have revolutionized natural language processing, their quadratic attention complexity and large key–value (KV) cache requirements pose serious challenges for deployment, particularly on edge devices or in latency-sensitive environments.

Read more ...

ROCm 7.0: An AI-Ready Powerhouse for Performance, Efficiency, and Productivity

16 September 2025

Artificial intelligence now defines the performance envelope for modern computation. In this blog, we introduce the AI-centric ROCm 7.0 designed to help our community directly benefit from this dramatic paradigm shift. ROCm 7.0 delivers a platform purpose-built for the era of generative AI, large-scale inference and training, and accelerated discovery, helping you boost the performance, efficiency, and scalability of your workloads.

Read more ...

Efficient LLM Serving with MTP: DeepSeek V3 and SGLang on AMD Instinct GPUs

11 September 2025

Speculative decoding has become a key technique for accelerating large language model inference. Its effectiveness, however, relies heavily on creating the right balance between speed and accuracy in the draft model. Recent advances in Multi-Token Prediction (MTP) integrate seamlessly with speculative decoding, enabling the draft model to be more lightweight and consistent with the base model—ultimately making inference both faster and more effective.

Read more ...

Exploring Use Cases for Scalable AI: Implementing Ray with ROCm Support for Efficient ML Workflows

10 September 2025

In this blog, you will learn how to use Ray to easily scale your AI applications from your laptop to multiple AMD GPUs.

Read more ...

Technical Dive into AMD’s MLPerf Inference v5.1 Submission

09 September 2025

In the rapidly evolving landscape of artificial intelligence, the demand for reliable and efficient model inference has never been greater. With advancements in large language models (LLMs) and a growing reliance on real-time applications, benchmarks are critical in evaluating how well AI systems perform under varying conditions. Enter MLPerf Inference: Datacenter v5.1 — a significant update to the well-respected benchmarking suite that assesses inference performance across a wide array of models and use cases, catering especially to data centers.

Read more ...

Slim Down Your Llama: Pruning & Fine-Tuning for Maximum Performance

09 September 2025

In this blog, we demonstrate how quantization, intelligent depth pruning and supervised fine-tuning can dramatically improve the inference performance of Meta’s Llama 3.1 405B model on AMD Instinct MI355X GPUs. By applying quantization and reducing the number of layers from the original 126, we are able to decrease memory requirements and boost token throughput. Additionally, with carefully applied fine-tuning, we maintain high inference accuracy for both RougeL and Exact Match metrics on MLPerf workloads. To see how these optimizations fit into AMD’s broader MLPerf Inference v5.1 efforts, read Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.1 Submission. For a detailed technical breakdown into other optimizations, check out our Technical Dive into AMD’s MLPerf Inference v5.1 Submission.

Read more ...

Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.1 Submission

09 September 2025

MLPerf Inference v5.1 marks AMD’s third round of submissions and the most ambitious yet. This round features submissions on AMD Instinct MI325X and MI355X systems, including multi-node inference and models in MXFP4 datatype. Building upon the success in MLPerf Inference v5.0, AMD has submitted improved results for Llama 2 70B and SDXL on the MI325X platform in this round using new optimization techniques. For a deeper look at these optimizations, see our Technical Dive into AMD’s MLPerf Inference v5.1 Submission. Additionally, explore how we optimized Llama 3.1 405B through pruning and fine-tuning in Slim Down Your Llama: Pruning & Fine-Tuning for Maximum Performance. In addition, AMD has made submissions for the following workloads:

Read more ...

Llama.cpp Meets Instinct: A New Era of Open-Source AI Acceleration

09 September 2025

Llama.cpp is an open source implementation of a Large Language Model (LLM) inference framework designed to run efficiently on diverse hardware configurations, both locally and in cloud environments. Its plain C/C++ implementation ensures a dependency-free setup, allowing it to seamlessly support various hardware architectures across CPUs and GPUs. The framework offers a range of quantization options, including 1.5-bit to 8-bit integer quantization, to achieve faster inference and reduced memory usage. Llama.cpp is part of an active open-source community within the AI ecosystem, with over 1200 contributors and almost 4000 releases on its official GitHub repository as of early August, 2025. Designed as a CPU-first C++ library, llama.cpp offers simplicity and easy integration with other programming environments - making it widely compatible and rapidly adopted across diverse platforms, especially among consumer devices.

Read more ...

GEMM Tuning within hipBLASLt - Part 1

05 September 2025

When optimizing matrix operations on AMD GPUs using the ROCm platform, tuning specific problem sizes is essential for achieving maximum performance. The hipBLASLt library supports two official tuning mechanisms:

Read more ...

Step-3 Deployment Simplified: A Day 0 Developer’s Guide on AMD Instinct™ GPUs

04 September 2025

Today’s large language models (LLMs) still face high decoding costs for long-context reasoning tasks. Step-3 is a 321B-parameter open-source vision-language model (VLM) designed with hardware-aware model–system co-design to minimize decoding costs. With strong support from the open-source community—especially SGLang and Triton—AMD is excited to bring Step-3 to our Instinct™ GPU accelerators.

Read more ...

Unleashing AMD Instinct™ MI300X GPUs for LLM Serving: Disaggregating Prefill & Decode with SGLang

28 August 2025

LLM inference pipelines are hitting a scalability wall as prefill and decode phases compete for the same compute, causing latency spikes and underutilized resources. DistServe tackles this by disaggregating prefill and decode computation across separate GPUs—eliminating interference, decoupling resource planning, and unlocking new levels of optimization for both time-to-first-token (TTFT) and time-per-output-token (TPOT).

Read more ...

QuickReduce: Up to 3x Faster All-reduce for vLLM and SGLang

26 August 2025

Advancements in large-scale language models (LLMs) have led to significant performance breakthroughs across various domains, especially in natural language processing. LLMs typically consist of billions of parameters, resulting in substantial computational, storage, and deployment challenges. Inter-GPU communication overhead often emerges as a key bottleneck limiting overall system performance. In tensor-parallel setups, every layer requires frequent all-reduce operations—synchronizing large amounts of data across GPUs. This introduces significant latency and strains interconnect bandwidth.

Read more ...

AITER-Enabled MLA Layer Inference on AMD Instinct MI300X GPUs

25 August 2025

For developers pushing LLM inference to its limits, efficiency and speed are non-negotiable. DeepSeek-V3’s Multi-head Latent Attention (MLA) layer rethinks traditional attention to cut memory bandwidth pressure while maintaining accuracy. Combined with the matrix absorbed optimization and AMD’s AI Tensor Engine for ROCm (AITER), this can deliver up to 2X faster inference on AMD Instinct™ MI300X GPUs compared to non-AITER runs.

Read more ...

Primus: A Lightweight, Unified Training Framework for Large Models on AMD GPUs

22 August 2025

Training large language models (LLMs) at scale is inherently complex. Different frameworks expose inconsistent interfaces, multi-GPU and distributed setups require brittle scripting, and backend-specific quirks introduce overhead that slows down training iterations. Primus tackles these challenges with a streamlined, backend-agnostic training framework that helps developers launch, customize, and scale training jobs faster on AMD GPUs.

Read more ...

Introducing AMD EVLM: Efficient Vision-Language Models with Parameter-Space Visual Conditioning

22 August 2025

This blog introduces a novel and computationally efficient paradigm for Vision-Language Models (VLMs), which diverges from the conventional method of prepending visual tokens to textual input. Instead of elongating the input sequence, this approach injects visual information directly into the Large Language Model’s (LLM) parameters. It achieves this by using a vision encoder to extract image features and then employing a perceptual weight generator to transform these features into dynamic, low-rank adapter weights. These weights are temporarily integrated with the LLM’s parameters, effectively conditioning the model on the image without increasing the input length. This mechanism allows the model to achieve performance comparable to traditional VLMs on standard benchmarks while significantly reducing computational costs during inference.

Read more ...

DGL in the Real World: Running GNNs on Real Use Cases

20 August 2025

In our previous blog post, we introduced the Deep Graph Library (DGL) and highlighted how its support on the AMD ROCm platform unlocks scalable, performant graph neural networks (GNNs) on AMD GPUs. That post focused on the why — the growing relevance of graph workloads and what it means to bring that capability to AMD’s accelerated computing ecosystem.

Read more ...

Wan2.2 Fine-Tuning: Tailoring an Advanced Video Generation Model on a Single GPU

19 August 2025

This blog post will guide you through fine-tuning Wan2.2 - a state-of-the-art video generation model - on a single AMD Instinct MI300X GPU. By following this guide, you’ll unlock Wan2.2’s advanced video generation capabilities and customize the output — whether in a unique artistic style or a specialized domain — all while running memory efficiently even on a single GPU. Here are some examples of how you can put this guide into practice:

Read more ...

Running ComfyUI on AMD Instinct

19 August 2025

Building workflows for generative AI tasks can of course be done purely in code. However, as the interest in GenAI has soared together with its use in people’s daily lives, more and more people start to search for and explore tools and software for building GenAI workflows that do not require extensive programming knowledge. One such tool is ComfyUI, which provides users with a simple drag and drop UI for building GenAI workflows. This blog post will briefly cover what ComfyUI is, and how you can get it up and running on your AMD Instinct hardware.

Read more ...

All-in-One Video Editing with VACE on AMD Instinct GPUs

19 August 2025

This blog takes a closer look at recent advances in AI-powered video editing, highlighting how modern diffusion models enable users to accomplish various video editing tasks on AMD Instinct GPUs using Alibaba’s VACE model.

Read more ...

Accelerating FastVideo on AMD GPUs with TeaCache

19 August 2025

Video generation is entering a new era, powered by diffusion models that deliver photorealistic and temporally consistent results from text prompts. Models like Wan2.1 push the boundaries of what’s possible in AI-generated content, but to unlock their full potential, inference performance must scale with both model complexity and hardware capabilities.

Read more ...

Introducing Instella-Math: Fully Open Language Model with Reasoning Capability

09 August 2025

AMD is thrilled to introduce Instella-Math, a reasoning-focused language model that marks a major milestone for AMD: as far as we know, it’s the first language model trained with long chain-of-thought reinforcement learning entirely on AMD GPUs. Starting from Instella-3B-Instruct, we extended the model’s capabilities through a multi-stage training pipeline—featuring two stages of supervised fine-tuning and three stages of reinforcement learning using the VERL framework —executed entirely on AMD Instinct™ MI300X GPUs. This blog offers an inside look at the training process and highlights Instella-Math’s performance on challenging reasoning benchmarks, demonstrating the strength of both the model and the hardware behind it.

Read more ...

Running ComfyUI in Windows with ROCm on WSL

07 August 2025

If you have an AMD Radeon™ graphics card supported by AMD ROCm™ software, you can unlock the full potential of your Windows PC with ROCm using the Windows Subsystem for Linux (WSL). Whether you’re loading large models like Stable Diffusion for local use or exploring creative AI applications, this setup offers unprecedented accessibility and power right at your fingertips. In this blog, we provide a step-by-step guide for configuring a WSL-based ROCm environment to run ComfyUI, including driver installation, dependency management, and PyTorch integration optimized for AMD GPUs (see figure 1).

Read more ...

Day 0 Developer Guide: Running the Latest Open Models from OpenAI on AMD AI Hardware

05 August 2025

OpenAI has officially released its open models: gpt-oss-120b and gpt-oss-20b. AMD now provides out-of-the-box, day 0 support for the latest open models from OpenAI, enabling developers to easily fine-tune and deploy across cloud to client environments using AMD hardware, the AMD ROCm™ and AMD Ryzen™ AI software stack, and seamless open source integrations. At AMD, we’re excited to announce day 0 support across our AI hardware, including our flagship AMD Instinct™ MI355X and MI300X GPUs, AMD Radeon™ AI PRO R9700 GPUs, and AMD Ryzen™ AI processors.

Read more ...

AMD Hummingbird Image to Video: A Lightweight Feedback-Driven Model for Efficient Image-to-Video Generation

03 August 2025

In this blog, we present AMD Hummingbird-I2V, a lightweight and feedback-driven image-to-video generation model designed to deliver high-quality results efficiently on resource-constrained hardware. Image-to-video (I2V) generation has become a significant challenge in computer vision, driven by the increasing demand for automated content creation in areas such as digital media production, animation, and advertising. While recent advancements have improved video quality, deploying I2V models in practical scenarios remains challenging due to their large model sizes and high inference costs. For example, DynamiCrafter [1] employs a 1.4B-parameter U-Net and typically requires 50 denoising steps to synthesize a single video. Step-Video [2], a DiT-based model with 30B parameters, takes approximately 30 minutes to generate one video on an AMD Instinct ™ MI250 GPU, making it impractical for latency-sensitive or resource-constrained environments, such as gaming-oriented desktop GPUs. In this work, we present AMD Hummingbird-I2V, a compact and efficient diffusion-based I2V model designed for high-quality video synthesis under limited computational budgets. Hummingbird-I2V adopts a lightweight U-Net architecture with 0.9B parameters and a novel two-stage training strategy guided by reward-based feedback, resulting in substantial improvements in inference speed, model efficiency, and visual quality. To further improve output resolution with minimal overhead, we introduce a super-resolution module at the end of the pipeline. Additionally, we leverage ReNeg [3], an AMD proposed reward-guided framework for learning negative embeddings via gradient descent, to further boost visual quality. As a result, Hummingbird-I2V can generate high-quality 4K video in just 11 seconds with 16 inference steps on an AMD Radeon™ RX 7900 XTX GPU. Quantitative results on the VBench-I2V [4] benchmark show that Hummingbird-I2V achieves state-of-the-art performance among U-Net-based diffusion models and competitive results compared to significantly larger DiT-based models. We provide a detailed analysis of the model architecture, training methodology, and benchmark performance.

Read more ...

GEAK: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

01 August 2025

At AMD, we are pioneering ways to accelerate AI development using AI itself, by generating accurate and efficient GPU kernels. Specifically, we are starting with the automatic generation of kernels in Triton, an open-source Python-like language for writing parallel programming code for GPUs. Today, AMD is excited to announce (a) Generating Efficient AI-centric Kernels (GEAK) for AMD GPUs, and results on (b) two Triton kernel evaluation benchmarks, where we show how AI agents can perform inference-time scaling with frontier LLMs to generate accurate and efficient kernels for AMD Instinct™ GPUs like MI250X and MI300X.

Read more ...

Graph Neural Networks at Scale: DGL with ROCm on AMD Hardware

31 July 2025

03 October 2025

This blog introduces the Deep Graph Library (DGL) and explores its significance on AMD hardware for enabling scalable, performant graph neural networks.

Read more ...

Accelerating Parallel Programming in Python with Taichi Lang on AMD GPUs

31 July 2025

Taichi Lang is an open-source, imperative, parallel programming language for high-performance numerical computation. It is embedded in Python and uses just-in-time (JIT) compiler frameworks (e.g. LLVM) to offload the compute-intensive Python code to the native GPU or CPU instructions. The language has broad applications spanning real-time physical simulation, numerical computation, augmented reality, artificial intelligence, vision and robotics, visual effects in films and games, general-purpose computing, and much more [1].

Read more ...

Benchmarking Reasoning Models: From Tokens to Answers

24 July 2025

This blog shows you how to benchmark large language models’ reasoning tasks by distinguishing between mere token generation and genuine problem-solving. You will learn the importance of configuring models like Qwen3 with “thinking mode” enabled, how standard benchmarks can produce misleading results, why reasoning requires more than just generating tokens quickly, and how to build evaluations that reflect the model’s true problem-solving capabilities. Sounds interesting? Let’s dive right in!

Read more ...

Introducing ROCm-LS: Accelerating Life Science Workloads with AMD Instinct™ GPUs

18 July 2025

AMD is thrilled to announce the early access release of ROCm-LS (ROCm Life Science), a new cutting-edge software toolkit designed to accelerate life science computational workloads on AMD Instinct™ GPUs. ROCm-LS joins ROCm-DS as a part of AMD’s new family of toolkits aimed at providing powerful solutions to real world problems. Similar to ROCm-DS, ROCm-LS is built upon the established ROCm software ecosystem, offering a collection of components and libraries that address the pressing needs of the life science community. The early access release of ROCm-LS enables you to experiment with accelerating your life science workloads, such as digital pathology, automated medical image analysis, and feature extraction and enhancement in large TIFF files on AMD Instinct GPUs. Join us in exploring this tantalizing glimpse into the future capabilities of ROCm-LS, setting the stage for the next evolution in life science computing.

Read more ...

Announcing hipCIM: A Cutting-Edge Solution for Accelerated Multidimensional Image Processing

18 July 2025

In the rapidly evolving landscape of data science and computational imaging, hipCIM 1.0.0 introduces a powerful, GPU-accelerated open-source library that redefines multidimensional image processing for life sciences, biomedical research, and computational imaging. This open-source, accelerated software library redefines how multidimensional datasets are processed, offering unparalleled capabilities across scientific fields such as biomedical imaging, geospatial analytics, material sciences, life sciences, and remote sensing to name a few. With the initial release of hipCIM 1.0.0, AMD enters the arena, ready to push the boundaries of life science research and stand at the forefront of a new era in multidimensional image processing.

Read more ...

Vibe Coding Pac-Man Inspired Game with DeepSeek-R1 and AMD Instinct MI300X

17 July 2025

AI systems have been constrained by their narrow capabilities and limited contextual understanding. Modern large language models (LLMs), such as GPT-4, Claude, DeepSeek, and CodeLlama, are different from previous approaches to AI. LLMs leverage vast datasets and incorporate natural language and code repositories. This enables them to understand natural language syntax, semantics, and programming logic in multiple programming languages (Python, JavaScript, C++, etc.)

Read more ...

Instella-T2I: Open-Source Text-to-Image with 1D Tokenizer and 32× Token Reduction on AMD GPUs

15 July 2025

In this blog, we introduce Instella T2I, text-to-image models in the AMD open-source Instella model family built from scratch on AMD Instinct™ MI300X GPUs. We’ll walk through the model architecture, training pipeline, tokenizer innovations, and how the system scales efficiently across MI300X GPUs. Instella-T2I v0.1 sets a new baseline for scalable, high-resolution open-source text-to-image generation. You will also explore how AMD is helping advance this space—and how you can get started with the model today. In Instella-T2I, we build upon the rapid advancements in large language models (LLMs) and investigate the use of decoder-only models as text encoders in T2I models as shown in Figure 1.

Read more ...

Fine-tuning Robotics Vision Language Action Models with AMD ROCm and LeRobot

14 July 2025

This blog showcases training and deploying robotics policy models on AMD Instinct™ GPUs using ROCm with Hugging Face’s LeRobot framework. Recent advancements in Vision Language Action Models (VLAs) represent a breakthrough in robotics AI, combining computer vision, language understanding, and robotic control into unified architectures that can process visual observations, understand task descriptions, and generate precise motor commands.

Read more ...

Nitro-T: Training a Text-to-Image Diffusion Model from Scratch in 1 Day

09 July 2025

AMD is excited to release Nitro-T, a family of text-to-image diffusion models focused on highly efficient training. Our models achieve competitive scores on image generation benchmarks compared to previous models focused on efficient training while requiring less than 1 day of training from scratch on 32 AMD Instinct MI300X GPUs.

Read more ...

vLLM V1 Meets AMD Instinct GPUs: A New Era for LLM Inference Performance

07 July 2025

vLLM has been a successful LLM inference and serving engine that excels at providing innovative features to users and developers. Earlier this year, the vLLM community introduced a major upgrade of its core engine and architecture to vLLM V1 (V1), which enhances the flexibility and scalability of the engine while retaining its core features. For simplicity, we’ll refer to vLLM V0 as “v0” and vLLM V1 as “V1” throughout this post. To align with the vLLM community’s continuous innovation, the AMD ROCm™ software team and open-source ROCm developers have enabled the fully optimized vLLM V1 engine on AMD GPUs.

Read more ...

Unlocking GPU-Accelerated Containers with the AMD Container Toolkit

03 July 2025

In the rapidly evolving fields of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML), containerization has become a cornerstone of modern application deployment. Containers provide a lightweight, portable, and scalable way to package applications and their dependencies. The integration of GPUs into these environments has become imperative. However, leveraging GPU acceleration within containers has historically been a complex and error-prone process, particularly when ensuring seamless access to GPU hardware resources.

Read more ...

Enabling Real-Time Context for LLMs: Model Context Protocol (MCP) on AMD GPUs

20 June 2025

The Model Context Protocol (MCP) is an open protocol introduced by Anthropic that standardizes how applications provide context to large language models (LLMs). It enables AI models to interface with various data sources and tools. MCP enhances the integration of LLMs with data and tools by offering pre-built integrations, flexibility in switching between different LLM providers, and ensuring data security best practices.

Read more ...

Fine-Tuning LLMs with GRPO on AMD MI300X: Scalable RLHF with Hugging Face TRL and ROCm

18 June 2025

In this blog, you will learn how to implement GRPO-based RLHF on AMD MI300X using ROCm and Hugging Face TRL—streamlining alignment training while enhancing model reasoning and inference performance. Reinforcement Learning from Human Feedback (RLHF) constitutes a critical phase in the fine-tuning of large language models (LLMs) and multimodal architectures. Over time, RLHF methodologies have advanced beyond traditional techniques, progressing from Proximal Policy Optimization (PPO) to Direct Preference Optimization (DPO), and more recently, to Group Relative Policy Optimization (GRPO). RLHF aims to make LLMs’ output better aligned with human preferences. Reinforcement Learning (RL) is an important step to enhance LLM’s reasoning capabilities and for better inference/test-time scaling law. Apart from LLM, there is also DPO application in text-to-image generation.

Read more ...

Continued Pretraining: A Practical Playbook for Language-Specific LLM Adaptation

18 June 2025

What if you could make a state-of-the-art LLM fluent in a new language—without training from scratch? In this guide, we show how we did just that with Finnish.

Read more ...

Aligning Mixtral 8x7B with TRL on AMD GPUs

12 June 2025

Building a ChatGPT-like assistant is a multi-step process that starts with pre-training a large language model (LLM) on internet-scale data across clusters of thousands of GPUs, resulting in what is known as a “base model”. This base model is then refined through an instruction based supervised fine-tuning (SFT) process, which trains it to function as a useful digital assistant capable of understanding and responding accurately to a wide range of queries. Finally, human preference alignment is applied to enhance the model’s friendliness, helpfulness, and safety, ensuring that interactions are not only informative but also pleasant for users. This combination of techniques creates a sophisticated assistant that is both powerful and user-centric—exemplified by AMD’s new Instella-Long assistant.

Read more ...

Introducing Instella-Long: A Fully Open Language Model with Long-Context Capability

11 June 2025

AMD is excited to announce Instella-Long, a long-context language model continually trained from Instella-3B-Instruct on AMD Instinct™ MI300X GPUs. To our knowledge, Instella-Long makes Instella series the first fully open language model trained from scratch that supports long-context. Instella-Long can support 128K context length and achieve competitive performance outperforming open-weights models such as Phi-3.5-mini [1], Gemma-3-4B [2], and Qwen2.5-3B [3] on the long-context benchmark.

Read more ...

AMD ROCm: Powering the World’s Fastest Supercomputers

10 June 2025

From breaking the exaFLOP barrier with Frontier to setting new performance records with El Capitan, AMD is transforming what’s possible in high-performance computing (HPC). But the story goes beyond hardware. At the core of these world-class systems is ROCm, AMD’s open, high-performance software platform enabling new levels of scientific discovery and AI advancement.

Read more ...

The ROCm Revisited Series

06 June 2025

The ROCm Revisited series aims to revisit key concepts of the AMD ROCm software platform, tools, and optimizations, tailored for beginner and intermediate developers. This series shares our journey through the evolution of ROCm, highlighting the milestones, innovative technologies, and challenges we’ve overcome to establish leadership in the supercomputing space. Each post explores different aspects of ROCm’s development, focusing on how it has transformed industries, particularly in AI, machine learning, and high-performance computing (HPC). Through these blog posts, we’ll also discuss our commitment to open-source development and the future potential of distributed and energy-efficient computing. Below are the three blogs included in the series:

Read more ...

ROCm Revisited: Evolution of the High-Performance GPU Computing Ecosystem

06 June 2025

09 June 2025

This blog is part of our ROCm Revisited series [1]. The purpose of this series is to share the story of ROCm and our journey through the changes and successes we’ve achieved over the past few years. We’ll explore the key milestones in our development, the innovative technologies that have propelled us forward, and the challenges we’ve overcome to establish our leadership in the world of GPU computing.

Read more ...

Reproduce AMD’s MLPerf Training v5.0 Submission Result with Instinct™ GPUs

04 June 2025

In recent years, large language models (LLMs) have transformed the landscape of natural language processing, enabling breakthroughs in tasks ranging from code generation to answering complex questions. Among these, the Llama 2 model family developed by Meta has emerged as a powerful and versatile set of open weight transformer-based models, known for their competitive performance across diverse NLP benchmarks. With model sizes ranging from 7 billion to 70 billion parameters, Llama 2 has quickly become a popular choice for both research and industry after its release in 2023, striking a balance between scalability and efficiency.

Read more ...

AMD’s MLPerf Training Debut: Optimizing LLM Fine-Tuning with Instinct™ GPUs

04 June 2025

MLPerf Training is one of the most influential benchmarks in the AI community, playing a critical role in measuring and advancing the performance of machine learning training across diverse hardware and software platforms. Established to provide a fair, standardized way to evaluate training speed and efficiency on real-world workloads, MLPerf Training has become the chosen standard for researchers, engineers, and organizations striving to test the boundaries of AI capability. By fostering transparency and innovation, it focuses on progression in both academic research and industry applications, helping the community identify the most effective technologies to power the next generation of intelligent systems.

Read more ...

High-Throughput BERT-L Pre-Training on AMD Instinct™ GPUs: A Practical Guide

03 June 2025

This blog showcases an implementation of the BERT-L model on the AMD Instinct™ GPUs using ROCm with advanced optimization including but not limited to mixed precision training, packed datasets, Flash Attention and MLPerf-compliant techniques. BERT (Bidirectional Encoder Representations from Transformers) is a language representation model developed by researchers at Google in 2018. It is based on the Transformer architecture and processes text bidirectionally, which contrasts with traditional models that read text sequentially.

Read more ...

Scale LLM Inference with Multi-Node Infrastructure

30 May 2025

Horizontal scaling of compute resources has become a critical aspect of modern computing due to the ever-increasing growth in data and computational demands. Unlike vertical scaling, which focuses on enhancing an individual system’s resources, horizontal scaling enables the expansion of a system’s capabilities by adding more instances or nodes working in parallel. In this way, it ensures high availability and low latency of the service, making it essential to handle diverse workloads and ensure optimal user experience.

Read more ...

From Theory to Kernel: Implement FlashAttention-v2 with CK-Tile

21 May 2025

In our previous blog, Hands on with CK Tile we walked through how to build a basic GEMM kernel using CK-Tile. In this blog, we will further explore the implementation of a fused kernel, specifically introducing the FlashAttention (FA)-v2 forward kernel. Figure 1 provides an overview of the FlashAttention kernel executions and data movements that occur during the computation of a single thread block of output matrix. Each of the subsequent sections explains details on how to implement this using CK-Tile.

Read more ...

AMD Integrates llm-d on AMD Instinct MI300X Cluster For Distributed LLM Serving

20 May 2025

AMD has successfully deployed the open-source llm-d framework on AMD Kubernetes infrastructure as part of our efforts for distributed large language model inference at scale. It leverages Kubernetes-native toolkit to streamline LLM serving with features like KV-cache-aware routing, distributed scheduling, and integration with Inference Gateway (IGW). In this blog we showcase initial deployment on an AMD cluster with distributed prefill and decode stages on a Llama model.

Read more ...

Step-Video-T2V Inference with xDiT on AMD Instinct MI300X GPUs

15 May 2025

The Stepfun Step-Video-T2V is a 30B parameter state-of-the-art text-to-video (T2V) model capable of generating high-quality videos of up to 204 frames. As video generation advances toward Artificial General Intelligence (AGI), such models play a key role in automating and democratizing video creation. In this blog, we introduce Step-Video-T2V with xDiT running efficiently out-of-the-box on multi-GPU systems powered by AMD Instinct™ MI300X, leveraging high-bandwidth memory and ROCm ™ for fast, scalable video generation.

Read more ...

DataFrame Acceleration: hipDF and hipDF.pandas on AMD GPUs

07 May 2025

In our previous blog CuPy and hipDF on AMD: The Basics and beyond, we explored the fundamentals of hipDF and demonstrated the significant speed up it provides when compared to Pandas for data manipulation tasks, particularly when AMD GPUs are used.

Read more ...

CuPy and hipDF on AMD: The Basics and Beyond

06 May 2025

This blog introduces you to CuPy and hipDF, two GPU-oriented high-performance computing Python libraries. This blog will show you how to deploy CuPy and hipDF on AMD GPUs using ROCm, and demonstrate the advantages of CuPy and hipDF over their traditional CPU-orientated counterparts, NumPy and Pandas.

Read more ...

Power Up Qwen 3 with AMD Instinct: A Developer’s Day 0 Quickstart

28 April 2025

AMD is excited to announce Day 0 support for Alibaba’s latest Large Language Models Qwen3-235B Qwen3-32B Qwen3-30B on AMD Instinct™ MI300X GPU accelerators using vLLM and SGLang. In this blog we show you how to accelerate Alibaba’s cutting-edge Qwen 3 language models, featuring advanced reasoning, multilingual capabilities, and agent functionality, using AMD Instinct™ MI300X GPUs. You will learn to deploy dense and Mixture-of-Experts models with full support for vLLM and SGLang, leveraging AMD’s advanced GPU architecture for high-throughput, low-latency inference.

Read more ...

Beyond Text: Accelerating Multimodal AI Inference with Speculative Decoding on AMD Instinct™ MI300X GPUs

28 April 2025

In the rapidly evolving landscape of artificial intelligence, multimodal models have emerged as powerful tools capable of processing and generating content across different modalities—text, images, audio, and more. Meta’s recent release of the multimodal Llama 4 models, including Llama 4 Scout and Llama 4 Maverick, exemplifies this advancement. Despite their impressive functionalities, such models face significant computational challenges, particularly in generation speed and resource efficiency due to a much larger context length compared to text-only models. Enter speculative decoding: a promising technique that has revolutionized text generation in large language models and is now finding exciting applications in multimodal contexts. Speculative decoding allows AI models to generate outputs faster by speculating several steps ahead and confirming predictions in fewer passes. In this blog you will learn, step-by-step, how speculative decoding can help you unlock significant inference speedups for multimodal systems while maintaining output quality using ROCm on AMD Instinct MI300X GPUs.

Read more ...

Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration

24 April 2025

In this blog post, we provide an overview of Volcano Engine Reinforcement Learning for LLMs (verl) and discuss its benefits in large-scale reinforcement learning from human feedback (RLHF). We also detail the modifications made to the codebase to optimize verl’s performance on AMD Instinct GPUs. Next, we walk through the process of building the Docker image using a Dockerfile on the user side, along with training scripts tailored for both single-node and multi-node setups. Lastly, we present verl’s performance results, focusing on throughput and convergence accuracy achieved on AMD Instinct™ MI300X GPUs. Follow this guide to get started with verl on AMD Instinct GPUs and accelerate your RLHF training with ROCm-optimized performance.

Read more ...

A Step-by-Step Guide On How To Deploy Llama Stack on AMD Instinct™ GPU

22 April 2025

As a leader in high-performance computing, AMD empowers AI innovation by providing open-source tools and hardware acceleration for scalable model deployment. In this blog we will show you how this foundation can be leveraged to deploy Meta’s LLMs efficiently on AMD Instinct™ GPUs. Meta’s Llama series has democratized access to large language models, empowering developers worldwide. The Llama Stack—Meta’s all-in-one deployment framework—extends this vision by enabling seamless transitions from research to production through built-in tools for optimization, API integration, and scalability. This unified platform is ideal for teams requiring robust support to deploy Meta’s models at scale across diverse applications.

Read more ...

ROCm 6.4: Breaking Barriers in AI, HPC, and Modular GPU Software

11 April 2025

In the rapidly evolving landscape of high-performance computing and artificial intelligence, innovation is the currency of progress. AMD’s ROCm 6.4 isn’t just another software update—it’s a leap forward that redefines the boundaries of what is possible for AI, developers, researchers, and enterprise innovators.

Read more ...

Unlock Peak Performance on AMD GPUs with Triton Kernel Optimizations

10 April 2025

Triton is a domain-specific programming language designed to simplify GPU programming for high-performance tasks, particularly in AI applications. It provides an open-source environment that enables users to write high-level Triton code with greater productivity compared to Nvidia CUDA or AMD HIP. The Triton compiler translates Triton code into optimized GPUs instructions, effectively compiling tensor operations into low-level GPU code. It achieves high efficiency through multiple optimizations passes and leverages the underlying architecture of the GPU. To optimize GPU performance, it is important to have a solid understanding of the Triton compiler and the role it plays in kernel performance. In this blog, we will deep dive into the AMD Triton compiler, introduce Triton kernel compilation, and provide insights on how to create an efficient Triton kernel code.

Read more ...

Shrink LLMs, Boost Inference: INT4 Quantization on AMD GPUs with GPTQModel

09 April 2025

GPTQ (Generalized Post Training Quantization) is a technique for compressing Large Language Models (LLMs) after they have been fully trained by reducing their numerical precision. The objective of compressing the model is to reduce its memory footprint and computational requirements, making it easier to deploy it on hardware with limited resources.

Read more ...

Power Up Llama 4 with AMD Instinct: A Developer’s Day 0 Quickstart

06 April 2025

08 April 2025

AMD is excited to announce Day 0 support for Meta’s latest leading multimodal intelligence Models — the Llama 4 Maverick and Scout models — on our AMD Instinct™ MI300X and MI325X GPU accelerators using vLLM. In this blog we will walk you through a step-by-step guide on deploying Meta’s Llama4 model using vLLM, docker setup, dependencies, and inference testing.

Read more ...

Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.0 Submission

02 April 2025

Building upon the success of our MLPerf Inference v4.1 submission, AMD has submitted results for two popular models – Llama 2 70B and Stable Diffusion XL (SDXL) – in the MLPerf Inference v5.0 round. This blog post provides a comprehensive, step-by-step guide on reproducing the results of AMD’s MLPerf submission using ROCm and the AMD Instinct™ MI325X GPUs. Please follow along to independently verify these results and gain hands-on experience with the benchmarking process. If you are interested in learning more about the advanced optimization strategies behind our Llama 2 70B and SDXL inference, from quantization and General Matrix Multiplication (GEMM) tuning to cutting-edge vLLM scheduling and platform enhancements, check out our blog on MLPerf Inference v5.0 optimization strategies.

Read more ...

AMD Instinct™ MI325X GPUs Produce Strong Performance in MLPerf Inference v5.0

02 April 2025

AI transformation and its ever-increasing demands of GenAI, LLMs, reasoning models and new advances in inference and training emphasize the need for innovative GPU architectures and products designed and delivered at an accelerated pace. Understanding the performance of AI models on these GPUs is critical for continuous advances in AI deployments and adoption. However, benchmarking AI models is challenging due to their inherent complexity and variety of possible deployments and tasks. Approaching this problem from a cross-industry perspective is preferable to have a benchmark that is comparable across different platforms and vendors. MLPerf is such a benchmark created by a cross-industry MLCommons consortium of which AMD is a founding member.

Read more ...

Bring FLUX to Life on MI300X: Run and Optimize with Hugging Face Diffusers

28 March 2025

AI based text-to-image generation is pushing the boundaries of creative and visual storytelling, enabling the critical mass to draw like an artist. Stability AI introduced stable diffusion models which was a breakthrough in text to image generation. However, FLUX - a new state-of-the-art open-source model released by Black Forest Labs, is gaining popularity for its flexibility and controllability.

Read more ...

Accelerating LLM Inference: Up to 3x Speedup on MI300X with Speculative Decoding

27 March 2025

In this blog you will learn how speculative decoding boosts LLM inference, providing out-of-the-box speedups in LLM token generation on the AMD Instinct™ MI300X GPU. We start the blog by providing you with a brief overview of Speculative Decoding. We then demonstrate, through extensive benchmarking on a number of LLMs and datasets, as well as on different frameworks viz. vLLM and native PyTorch (gpt-fast), speedups in the range of 1.3x - 3x in the LLM generation throughput (tokens/second) through speculative decoding as compared to running a vanilla LLM for batch size 1. We show you how these speedups vary for batch sizes greater than 1 in vLLM. Finally, we will share a detailed profiling-based case study to identify some high-level differences between these two frameworks, i.e. the type of kernels that are launched and their overall latencies, which are critical differentiators between the performance of these frameworks. Let’s get started!

Read more ...

Introducing ROCprofiler SDK - The Latest Toolkit for Performance Profiling

25 March 2025

Profiling is the backbone of performance optimization in AI and HPC workloads, enabling developers to extract maximum efficiency from AMD Instinct™ GPUs. With ROCm’s rapid evolution, the need for a unified, scalable, and extensible profiling framework has never been more critical. The new ROCprofiler-SDK framework represents a significant step forward in profiling capabilities, offering enhanced features, streamlined integration, and a better user experience while also solving past limitations with former profiler interface versions. This guide aims to help users seamlessly transition from legacy profiling tools to the ROCprofiler-SDK infrastructure. We will explore new features, highlight key differences from previous tools, and provide actionable steps for a smooth migration.

Read more ...

Introducing ROCprofiler SDK - The Latest Toolkit for Performance Profiling

25 March 2025

Profiling is the backbone of performance optimization in AI and HPC workloads, enabling developers to extract maximum efficiency from AMD Instinct™ GPUs. With ROCm’s rapid evolution, the need for a unified, scalable, and extensible profiling framework has never been more critical. The new ROCprofiler-SDK framework represents a significant step forward in profiling capabilities, offering enhanced features, streamlined integration, and a better user experience while also solving past limitations with former profiler interface versions. This guide aims to help users seamlessly transition from legacy profiling tools to the ROCprofiler-SDK infrastructure. We will explore new features, highlight key differences from previous tools, and provide actionable steps for a smooth migration.

Read more ...

Speculative Decoding - Deep Dive

24 March 2025

Nowadays, LLM serving has become an increasingly popular service in the technology industry, with thousands of requests being sent to LLM servers, and responses generated and sent back to clients all over the world. The performance of online serving, as one of the key metrics to evaluate its user experience and service quality, has grabbed attention from both of the industry and academia.

Read more ...

Efficient MoE training on AMD ROCm: How-to use Megablocks on AMD GPUs

23 March 2025

Training massive deep-learning models requires a balance of efficiency and scalability. In the context of the Transformers architecture, Mixture of Experts (MoE) models are massive machine learning architectures characterized for dividing tasks among multiple specialized sub-networks or “experts”. A gating network determines the expert to which a given input should be routed, enabling the model to handle complex tasks more efficiently by using the specialized capabilities of each expert. This dynamic routing mechanism allows MoE models to scale efficiently, activating only a subset of the network for each input, therefore reducing computational load while maintaining high model capacity.

Read more ...

Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X

21 March 2025

Our previous blog post on this topic discussed how DeepSeek-R1 achieves competitive performance on AMD Instinct™ MI300X GPUs. We also included performance comparisons against Nvidia H200 GPUs and a short demo application illustrating real-world usage. In this blog we will delve into how using the SGLang framework, critical kernel optimizations like AI Tensor Engine for ROCm™, and hyperparameter tuning helps to achieve performance boosts.

Read more ...

AITER: AI Tensor Engine For ROCm

21 March 2025

24 March 2025

Performance optimization is critical when working with GPUs, especially for tasks involving artificial intelligence, which can be extremely demanding. To fully leverage the capabilities of advanced hardware, it’s essential to master optimization strategies and ensure every available resource is utilized efficiently. In this blog we will provide an overview of AMD’s AI Tensor Engine for ROCm (AITER) and show you how easy it is to integrate AITER kernels in basic LLM training and inference workload. AITER helps developers to focus on creating operators while allowing customers to seamlessly integrate this operator collection into their own private, public, or any custom framework.

Read more ...

Deploying Google’s Gemma 3 Model with vLLM on AMD Instinct™ MI300X GPUs: A Step-by-Step Guide

14 March 2025

AMD is excited to announce the integration of Google’s Gemma 3 models with AMD Instinct MI300X GPUs, optimized for high-performance inference using the vLLM framework. This collaboration empowers developers to harness advanced AMD AI hardware for scalable, efficient deployment of state-of-the-art language models. In this blog we will walk you through a step-by-step guide on deploying Google’s Gemma 3 model using vLLM on AMD Instinct GPUs, covering Docker setup, dependencies, authentication, and inference testing. Remember, the Gemma 3 model is gated—ensure you request access before beginning deployment.

Read more ...

Analyzing the Impact of Tensor Parallelism Configurations on LLM Inference Performance

14 March 2025

As AI models continue to scale in size and complexity, deploying them efficiently requires strategic resource allocation. Tensor parallelism (TP) is a valuable technique for distributing workloads across multiple GPUs, reducing memory constraints, and enabling inference for large-scale models. However, the choice of TP configuration isn’t one-size-fits-all—it directly impacts performance, networking overhead, and cost efficiency.

Read more ...

Optimized ROCm Docker for Distributed AI Training

13 March 2025

This blog will introduce you to the updated AMD Docker image, specifically built and optimized for distributed training. As you will see, the optimized AMD ROCm Docker image makes training large AI models faster and more efficient. It includes updates such as better fine-tuning tools, improved performance for multi-GPU setups, and support for FP8 precision, which helps speed up training while using less memory, and can provide you with an overall smoother and more efficient training experience on popular models such as Flux and Llama 3.1 running on AMD GPUs.

Read more ...

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 3

13 March 2025

Welcome back to the final part of our series! So far, we’ve successfully setup up a Kubernetes cluster and installed the AMD GPU Operator to seamlessly integrate AMD hardware with Kubernetes in Part 1. We’ve deployed vLLM on AMD Instinct MI300X GPUs, exposed it using MetalLB, and scaled it efficiently in Part 2.

Read more ...

AMD Advances Enterprise AI Through OPEA Integration

12 March 2025

AMD is excited to support Open Platform for Enterprise AI (OPEA) to simplify and accelerate enterprise AI adoption. With the enablement of OPEA GenAI framework on AMD ROCm™ software stack, businesses and developers can now create scalable, efficient GenAI applications on AMD data center GPUs. Enterprises today face significant challenges when deploying AI at scale, including the complexity of integrating GenAI models, managing GPU resources, ensuring security, and maintaining workflow flexibility. AMD and OPEA aim to address these challenges and streamline AI adoption. This blog will explore the significance of this collaboration, AMD’s contribution to the OPEA project, and demonstrate how to deploy a code translation OPEA GenAI use case on the AMD Instinct™ MI300X GPU.

Read more ...

Instella-VL-1B: First AMD Vision Language Model

07 March 2025

As part of AMD’s newly released Instella family we are thrilled to introduce Instella-VL-1B, the first AMD vision language model for image understanding trained on AMD Instinct™ MI300X GPUs. Our journey with Instella-VL builds upon our previous 1-billion-parameter language models, AMD OLMo SFT. We further extend the language model’s visual understanding abilities by connecting it with a vision encoder (which is initialized from CLIP ViT-L/14-336). During training, we jointly finetune vision encoder and language model with vision-language data in three stages: Alignment, Pretraining and Supervised-Finetuning (SFT).

Read more ...

Introducing Instella: New State-of-the-art Fully Open 3B Language Models

05 March 2025

AMD is excited to announce Instella, a family of fully open state-of-the-art 3-billion-parameter language models (LMs) trained from scratch on AMD Instinct™ MI300X GPUs. Instella models outperform existing fully open models of similar sizes and achieve competitive performance compared to state-of-the-art open-weight models such as Llama-3.2-3B, Gemma-2-2B, and Qwen-2.5-3B, including their instruction-tuned counterparts.

Read more ...

Measuring Max-Achievable FLOPs – Part 2

28 February 2025

In our previous blog post, we explored the conceptual differences between Peak FLOPs and Max-Achievable FLOPs (MAF), explaining why the gap between these metrics has widened with modern ML-optimized hardware. This second installment provides a detailed methodology for measuring MAF on AMD GPUs, including the specific environmental conditions, matrix size optimization techniques, and tools required for accurate measurement. We present the actual MAF results for AMD Instinct MI300X and MI325X GPUs across different precision formats (FP16, BF16, and FP8) along with their corresponding median frequencies. We also explain how software efficiency and frequency management impact MAF, and demonstrate why boost clock capabilities remain important for latency-sensitive workloads such as LLM inference with small batch sizes.

Read more ...

Deploying Serverless AI Inference on AMD GPU Clusters

25 February 2025

Deploying Large Language Models (LLMs) in enterprise environments presents a multitude of challenges that organizations must navigate to harness their full potential. As enterprises expand their AI and HPC workloads, scaling the underlying compute and GPU infrastructure presents numerous challenges, including deployment complexities, resource optimization, and effective management of the compute resource fleet. In this blog, we will walk you through how to spin-up production-grade Serverless AI inference service on Kubernetes clusters by leveraging open source Knative/KServe technologies.

Read more ...

Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU

21 February 2025

In this blog, we explore how DeepSeek-R1 achieves competitive performance on AMD Instinct™ MI300X GPUs, along with performance comparisons to H200 and a short demo application showcasing real-world usage. By leveraging MI300X, users can deploy DeepSeek-R1 and V3 models on a single node with impressive efficiency. In just two weeks, optimizations using SGLang have unlocked up to a 4X boost in inference speed, ensuring efficient scaling, lower latency, and optimized throughput. The MI300X’s high-bandwidth memory (HBM) and compute power enable execution of complex AI workloads, handling longer sequences and demanding reasoning tasks. With AMD and the SGLang community driving ongoing optimizations—including fused MoE kernels, MLA kernel fusion, and speculative decoding—MI300X is set to deliver an even more powerful AI inference experience.

Read more ...

How to Build a vLLM Container for Inference and Benchmarking

21 February 2025

Welcome back! If you’ve been following along with this series, you’ve already learned about the basics of ROCm containers. Today, we’ll build on that foundation by creating a container for large language model inference with vLLM.

Read more ...

Fine-tuning Phi-3.5-mini LLM at scale: Harnessing Accelerate and Slurm for multinode training

19 February 2025

In this blog you will learn the process of fine-tuning the Phi-3.5-mini-instruct Large Language Model (LLM) from Microsoft, using PyTorch in a multinode environment. The setup leverages the Hugging Face Accelerate library to handle the complexities of multi-GPU and multinode synchronization. Slurm is used to schedule and coordinate the job as a workload manager for high-performance computing environments. A custom Slurm Bash script launches the Docker containers on each node, ensuring the training environment is consistent across all machines. Inside the containers, PyTorch and the Accelerate library split the training data, synchronize the model updates, and optimize performance across the multinode setup. This approach lets you efficiently fine-tune large-scale models and reduce training time while maximizing hardware utilization across the entire cluster.

Read more ...

Understanding Peak, Max-Achievable & Delivered FLOPs, Part 1

14 February 2025

The purpose of this blog post is to provide information on the differences between Peak FLOPs and Max-achievable FLOPs. After reading, users will know how AMD measures maximum delivered performance, and how AMD recommends measured device performance is used.

Read more ...

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 2

14 February 2025

Welcome to Part 2 of our series on utilizing Kubernetes with the AMD Instinct platform! If you’re just joining us, we recommend checking out Part 1 where we covered setting up your Kubernetes cluster and enabling AMD GPU support.

Read more ...

Navigating vLLM Inference with ROCm and Kubernetes

13 February 2025

Kubernetes (often abbreviated as K8s) is an open-source platform designed for automating the deployment, scaling, and management of containerized applications. Developed by Google and now maintained by the Cloud Native Computing Foundation, Kubernetes enables developers to build, run, and manage applications across any infrastructure.

Read more ...

PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm

09 February 2025

PyTorch Fully Sharded Data Parallel (FSDP) is a data parallelism technique that enables the training of large-scale models in a memory-efficient manner. FSDP achieves this memory efficiency by sharding model parameters, optimizer states, and/or gradients across GPUs, reducing the memory footprint required by each GPU. This enables the training of large-scale models with lower total GPU memory than DDP (Distributed Data Parallel), in which the model weights and optimizer states are replicated across all processes. To learn more about DDP, refer to Distributed Data Parallel (DDP) training on AMD GPU with ROCm.

Read more ...

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 1

07 February 2025

As organizations scale their AI inference workloads, they face the challenge of efficiently deploying and managing large language models across GPU infrastructure. This three-part blog series provides a production-ready foundation for orchestrating AI inference workloads on the AMD Instinct platform with Kubernetes.

Read more ...

GEMM Kernel Optimization For AMD GPUs

06 February 2025

Matrix multiplication underlies critical computational pathways in AI, with General Matrix Multiplication (GEMM) operations serving as performance-critical kernels in neural network architectures. From fully connected layers to convolutions and transformer attention mechanisms, GEMMs consume substantial computational and memory resources in large language models (LLMs). This blog explores GEMM optimization techniques for AMD GPUs, demonstrating methodologies to significantly enhance computational efficiency and performance scaling.

Read more ...

Enhancing AI Training with AMD ROCm Software

31 January 2025

ROCm™ has emerged as a premier open software stack designed to address the evolving needs of AI and machine learning workloads. Built for inference and training, ROCm delivers leadership performance, empowering developers and organizations to optimize their workloads for efficiency, scalability, and cost-effectiveness.

Read more ...

Best practices for competitive inference optimization on AMD Instinct™ MI300X GPUs

29 January 2025

Optimizing LLM performance on GPUs is challenging due to diverse model needs, memory constraints, and balancing latency and throughput. This document examines how hardware utilization, memory and communication bandwidth and scaling, contribute to inference performance, detailing optimal configurations for AMD Instinct™ MI300X GPUs.

Read more ...

Distributed fine-tuning of MPT-30B using Composer on AMD GPUs

28 January 2025

Composer, developed by MosaicML, is an open-source deep learning training library built on top of PyTorch, designed to simplify and optimize distributed training workflows. It supports scalable training on multiple nodes and efficiently handles datasets of various sizes. Composer integrates advanced techniques such as PyTorch Fully Sharded Data Parallelism (FSDP), elastic sharded checkpointing, training callbacks, and speed-up algorithms to enhance training performance and flexibility. It closely resembles PyTorch’s torchrun and has demonstrated exceptional efficiency when scaling to hundreds of GPUs.

Read more ...

Vision Mamba on AMD GPU with ROCm

24 January 2025

State Space Models (SSMs), such as Mamba, have emerged as a potential alternative to Transformer models. Vision backbones using only SSMs have yielded promising results. For more information about SSMs and Mamba’s performance on AMD hardware, see Mamba on AMD GPUs with ROCm. This blog explores Vision Mamba (Vim), an innovative and efficient backbone for vision tasks and evaluate its performance on AMD GPUs with ROCm. We’ll start with a brief introduction to Vision Mamba, followed by a step-by-step guide on training and running inference with Vision Mamba on AMD GPUs using ROCm.

Read more ...

Getting started with AMD ROCm containers: from base images to custom solutions

16 January 2025

Having worked in technology for over two decades, I’ve witnessed firsthand how containerization has transformed the way we develop and deploy applications. Containers package applications with their dependencies into standardized units, making software portable and consistent across different environments. When we combine this containerization power with AMD Instinct™ Accelerators, we get a powerful solution for quickly deploying AI and machine learning workloads. In this blog, the first in a series exploring ROCm containerization, I want to share my insights about AMD ROCm™ containers and show you how to build and customize your own GPU-accelerated workloads. You’ll learn how to select appropriate base images, modify containers for your specific needs, and implement best practices for GPU-enabled containerization - all with hands-on examples you can try yourself.

Read more ...

Triton Inference Server with vLLM on AMD GPUs

08 January 2025

Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained AI models from various machine learning and deep learning frameworks including Tensorflow, PyTorch, and vLLM, making it adaptable for diverse AI workloads. It is designed to work across multiple environments, including cloud, data centers and edge devices.

Read more ...

Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators

10 December 2024

This blog is contributed by Zyphra: a Palo Alto-based AI research lab and AMD Instinct Partner.

Read more ...

Transformer based Encoder-Decoder models for image-captioning on AMD GPUs

03 December 2024

Image captioning, or the GenAI-based automatic generation of concise textual descriptions of images, has immensely important real-world applications. For example, image captioning can provide visually impaired users with textual descriptions of images for improved accessibility, image captioning can add textual descriptions to products in e-commerce applications and help children map images to their textual descriptions in early childhood educational apps. Image captioning can automatically describe objects and events in security camera footage in surveillance applications and can enable robots to auto-generate textual captions for objects and events they encountered in human-robot interaction (HRI) applications, and many more applications. Image captioning is a sequence-to-sequence (seq2seq) machine learning task: a model converting a sequence from one domain (in this case, the image), to another (its textual description). In image captioning the image is partitioned into a sequence of patches. This sequence of image patches is then converted by the model to a corresponding sequence of text tokens.

Read more ...

SGLang: Fast Serving Framework for Large Language and Vision-Language Models on AMD Instinct GPUs

13 November 2024

In the rapidly evolving landscape of artificial intelligence, the ability to deploy large language models (LLMs) and vision-language models (VLMs) efficiently is crucial for real-time applications. SGLang is an open-source framework designed to meet these demands by delivering fast backend runtime, a flexible frontend language, and extensive model support for a variety of LLMs and VLMs.

Read more ...

Quantized 8-bit LLM training and inference using bitsandbytes on AMD GPUs

13 November 2024

In this blog post we will cover the bitsandbytes 8-bit representations. As you will see, the bitsandbytes 8-bit representations significantly help reduce the memory needed for fine-tuning and inferencing LLMs. There are many quantization techniques used in the field to decrease a model size, but bitsandbytes offers quantization to decrease the size of optimizer states as well. This post will help you understand the basic principles underlying the bitsandbytes 8-bit representations, explain the bitsandbytes 8-bit optimizer and LLM.int8 techniques, and show you how to implement these on AMD GPUs using ROCm.

Read more ...

Distributed Data Parallel Training on AMD GPU with ROCm

01 November 2024

With the increase in complexity and size of machine learning models, the demand for computational resources grows. Training on a single GPU can become a bottleneck for deep learning applications, especially with large datasets and models that are slow to train on a single GPU. Parallelized training addresses this challenge. Out of the various forms of parallelized training, this blog focuses on Distributed Data Parallel (DDP), a key feature in PyTorch that accelerates training across multiple GPUs and nodes.

Read more ...

Torchtune on AMD GPUs How-To Guide: Fine-tuning and Scaling LLMs with Multi-GPU Power

24 October 2024

This blog provides a thorough how-to guide on using Torchtune to fine-tune and scale large language models (LLMs) with AMD GPUs. Torchtune is a PyTorch library designed to let you easily fine-tune and experiment with LLMs. Using Torchtune’s flexibility and scalability, we show you how to fine-tune the Llama-3.1-8B model for summarization tasks using the EdinburghNLP/xsum dataset. Using LoRA(Low-Rank Adaptation), a parameter-efficient fine-tuning technique, Torchtune enables efficient training while maintaining performance across a different number of GPUs (2, 4, 6, and 8). This post also highlights how Torchtune’s distributed training capabilities allow users to scale up LLM fine-tuning on multiple GPUs to reduce training time while maintaining the quality of the trained model, demonstrating its potential and usage on modern AMD hardware using ROCm.

Read more ...

CTranslate2: Efficient Inference with Transformer Models on AMD GPUs

24 October 2024

Transformer models have revolutionized natural language processing (NLP) by delivering high-performance results in tasks like machine translation, text summarization, text generation, and speech recognition. However, deploying these models in production can be challenging due to their high computational and memory requirements. CTranslate2 addresses these challenges by providing a custom runtime that implements various optimization techniques to accelerate Transformer models during inference.

Read more ...

Inference with Llama 3.2 Vision LLMs on AMD GPUs Using ROCm

23 October 2024

Meta’s Llama models now support multimodal capabilities, expanding their functionality beyond traditional text-only applications. The Llama 3.2 models are available in a range of sizes, including medium-sized 11B and 90B multimodal models for vision-text reasoning tasks, and lightweight 1B and 3B text-only models designed for edge and mobile devices.

Read more ...

Speed Up Text Generation with Speculative Sampling on AMD GPUs

15 October 2024

24 February 2025

As the size of transformer models grow, so does the cost of conducting inference, impacting latency and throughput. Compression methods such as quantization and distillation, as well as hardware-aware optimizations such as Flash Attention and Triton, have been proposed to cut down the computation cost at different levels. However, these models either compromise on accuracy or require major changes to the model implementation.

Read more ...

Multinode Fine-Tuning of Stable Diffusion XL on AMD GPUs with Hugging Face Accelerate and OCI’s Kubernetes Engine (OKE)

15 October 2024

As the scale and complexity of generative AI and deep learning models grow, multinode training, basically dividing a training job across several processors, has become an essential strategy to speed up training and fine-tuning processes of large generative AI models like SDXL. By distributing the training workload across multiple GPUs on multiple nodes, multinode setups can significantly accelerate the training process. In this blog post we will show you, step-by step, how to set-up and fine-tune a Stable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI) Kubernetes Engine (OKE) on AMD GPUs using ROCm.

Read more ...

Supercharging JAX with Triton Kernels on AMD GPUs

09 October 2024

Ready to supercharge your deep learning applications on AMD GPUs? In this blog, we’ll show you how to develop a custom fused dropout activation kernel for matrices in Triton, seamlessly call it from JAX, and benchmark its performance with ROCm. This powerful combination will take your model’s performance to the next level.

Read more ...

Leaner LLM Inference with INT8 Quantization on AMD GPUs using PyTorch

03 October 2024

With the scale of large language models (LLMs) reaching hundred of billions of parameters, the ways we represent data within these enormous models dramatically impacts the resources required to train them (e.g. the number of GPUs needed for inference). In our previous blogs (JAX mixed precision training; PyTorch AMP), we already demonstrated how mixed precision training can accelerate LLMs training process. In this blog post we will push things further and show you how quantization into an even lower precision data formats can speed up inference, saving time and memory, without sacrificing the overall performance of the model. Quantization is a technique where the precision of a model’s parameters is reduced from a 32-bit floating point (FP32) or a 16-bit floating point (FP16) to an 8-bit integer (INT8). Standard models typically use 32-bit floating-point (FP32) precision. However, this higher precision is not always necessary for inference tasks. By converting model weights and activations to lower precision formats like INT8 (8-bit integer), we can achieve faster computations and lower memory usage, effectively reducing the model size by three-fourths (from 32-bit) or half (from 16-bit) with only a slight accuracy reduction, which is often outweighed by the speed gains.

Read more ...

Fine-tuning Llama 3 with Axolotl using ROCm on AMD GPUs

23 September 2024

Large Language Models (LLMs) have revolutionized the field of natural language processing, enabling machines to understand and generate human-like language. However, these models are often trained on vast amounts of general-purpose data, which can make them less effective for specific tasks or domains. Fine-tuning involves training a pre-trained LLM on a specialized dataset to enhance its performance on specific tasks. As Andrej Karpathy analogized, this process is akin to allowing someone to practice a particular skill. Just as a person might need to practice a skill in a specific context to become proficient, an LLM needs to be fine-tuned on a specific dataset to become proficient in a particular task. For instance, an LLM can be fine-tuned for tasks such as financial forecasting, technical support, legal advising, medical diagnosis, or even instruction following. By fine-tuning an LLM, organizations can achieve better results and improve information security by limiting the exposure of sensitive data.

Read more ...

Inferencing and serving with vLLM on AMD GPUs

19 September 2024

09 June 2025

In the rapidly evolving field of artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools for understanding and generating human-like text. However, deploying these models efficiently at scale presents significant challenges. This is where vLLM comes into play. vLLM is an innovative open-source library designed to optimize the serving of LLMs using advanced techniques. Central to vLLM is PagedAttention, a novel algorithm that enhances the efficiency of the model’s attention mechanism by managing it as virtual memory. This approach optimizes GPU memory utilization, facilitating the processing of longer sequences and enabling more efficient handling of large models within existing hardware constraints. Additionally, vLLM incorporates continuous batching to maximize throughput and minimize latency. By leveraging these cutting-edge techniques, vLLM significantly improves the performance and scalability of LLM deployment, allowing organizations to harness the power of state-of-the-art AI models more effectively and economically.

Read more ...

Enhancing vLLM Inference on AMD GPUs

19 September 2024

09 June 2025

In this blog, we’ll demonstrate the latest performance enhancements in vLLM inference on AMD Instinct accelerators using ROCm 6.2. In a nutshell, vLLM optimizes GPU memory utilization, allowing more efficient handling of large language models (LLMs) within existing hardware constraints, maximizing throughput and minimizing latency. We start the blog by briefly explaining how causal language models like Llama 3 and ChatGPT generate text, motivating the need to enhance throughput and reduce latency. If you’re new to vLLM, we also recommend reading our introduction to Inferencing and serving with vLLM on AMD GPUs. ROCm 6.2 introduces support for the following vLLM features which we will use in this blog post.

Read more ...

Optimize GPT Training: Enabling Mixed Precision Training in JAX using ROCm on AMD GPUs

06 September 2024

This blog builds on the nanoGPT model we discussed in A Guide to Implementing and Training Generative Pre-trained Transformers (GPT) in JAX on AMD GPUs. Here we will show you how to incorporate mixed precision training to the JAX-implemented nanoGPT model we discussed in our previous blog.

Read more ...

Image Classification with BEiT, MobileNet, and EfficientNet using ROCm on AMD GPUs

03 September 2024

Image classification is a key task in computer vision aiming at “understanding” an entire image. The outcome of an image classifier is a label or a category for the image as a whole, unlike object recognition where the task is to detect and classify multiple objects within an image.

Read more ...

Benchmarking Machine Learning using ROCm and AMD GPUs: Reproducing Our MLPerf Inference Submission

28 August 2024

Measuring the performance of new technologies is as old as human history, and often as intriguing (consider for example that we still compare the performance of new electric vehicle motors using horsepower). In the rapidly advancing field of machine learning (ML) MLPerf was established by MLCommons on May 2nd 2018 and quickly became the golden standard of measuring the accuracy, speed, and efficiency of AI. MLPerf provides benchmarks on training, HPC and Inference performance. Companies across the industry use MLPerf submissions to evaluate the performance of various GPUs and software platforms, and make their technology adoption decisions based on these results.

Read more ...

Performing natural language processing tasks with LLMs on ROCm running on AMD GPUs

21 August 2024

In this blog you will learn how to use ROCm, running on AMD’s Instinct GPUs, for a range of popular and useful natural language processing (NLP) tasks, using different large language models (LLMs). The blog includes a simple to follow hands-on guide that shows you how to implement LLMs for core NLP applications ranging from text generation and sentiment analysis to extractive question answering (QA), and solving a math problem.

Read more ...

Using AMD GPUs for Enhanced Time Series Forecasting with Transformers

19 August 2024

Time series forecasting (TSF) is a key concept in fields such as signal processing, data science, and machine learning (ML). TSF involves predicting future behavior of a system by analyzing its past temporal patterns, using historical data to forecast future data points. Classical approaches to TSF relied on a variety of statistical methods. Recently, machine learning techniques have been increasingly used for TSF, generating discussions within the community about whether these modern approaches outperform the classical statistical ones (see: Are Transformers Effective for Time Series Forecasting? and Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)).

Read more ...

Inferencing with Grok-1 on AMD GPUs

09 August 2024

We demonstrate that the massive Grok-1 model from xAI can run seamlessly on the AMD MI300X GPU accelerator by leveraging the ROCm software platform.

Read more ...

Optimizing RoBERTa: Fine-Tuning with Mixed Precision on AMD

29 July 2024

In this blog we explore how to fine-tune the Robustly Optimized BERT Pretraining Approach (RoBERTa) large language model, with emphasis on PyTorch’s mixed precision capabilities. Specifically, we explore using AMD GPUs for mixed precision fine-tuning to achieve faster model training without any major impacts on accuracy.

Read more ...

Using statistical methods to reliably compare algorithm performance in large generative AI models with JAX Profiler on AMD GPUs

22 July 2024

This blog provides a comprehensive guide on measuring and comparing the performance of various algorithms in a JAX-implemented generative AI model. Leveraging the JAX Profiler and statistical analysis, this blog demonstrates how to reliably evaluate key steps and compare algorithm performance on AMD GPUs.

Read more ...

DBRX Instruct on AMD GPUs

11 July 2024

In this blog, we showcase DBRX Instruct, a mixture-of-experts large language model developed by Databricks, on a ROCm-capable system with AMD GPUs.

Read more ...

Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm

11 July 2024

PyTorch 2.0 introduces torch.compile(), a tool to vastly accelerate PyTorch code and models. By converting PyTorch code into highly optimized kernels, torch.compile delivers substantial performance improvements with minimal changes to the existing codebase. This feature allows for precise optimization of individual functions, entire modules, and complex training loops, providing a versatile and powerful tool for enhancing computational efficiency.

Read more ...

A Guide to Implementing and Training Generative Pre-trained Transformers (GPT) in JAX on AMD GPUs

02 July 2024

2 July, 2024 by

.

Read more ...

Fine-tuning and Testing Cutting-Edge Speech Models using ROCm on AMD GPUs

27 June 2024

AI Voice agents, or voice bots, are designed to communicate with people using a spoken language. Voice bots are commonly deployed in customer service and personal assistant applications, and have the potential to enter and revolutionize almost every aspect of people’s interaction with technology that can benefit from the use of voice. Automatic Speech Recognition (ASR), the technology that processes human speech into text, is essential for the creation of AI Voice agents. In this blog post we will provide you with a hands-on introduction to the deployment of three machine learning ASR models, using ROCm on AMD GPUs.

Read more ...

TensorFlow Profiler in practice: Optimizing TensorFlow models on AMD GPUs

18 June 2024

TensorFlow Profiler consists of a set of tools designed to measure resource utilization and performance during the execution of TensorFlow models. It offers insights into how a model interacts with hardware resources, including execution time and memory usage. TensorFlow Profiler helps in pinpointing performance bottlenecks, allowing us to fine-tune the execution of models for improved efficiency and faster outcomes which can be crucial in scenarios where near-real-time predictions are required.

Read more ...

Segment Anything with AMD GPUs

04 June 2024

4 Jun, 2024 by

.

Read more ...

Unveiling performance insights with PyTorch Profiler on an AMD GPU

29 May 2024

29 May, 2024 by

.

Read more ...

Panoptic segmentation and instance segmentation with Detectron2 on AMD GPUs

23 May 2024

23, May 2024 by

.

Read more ...

Accelerating Large Language Models with Flash Attention on AMD GPUs

15 May 2024

15, May 2024 by

.

Read more ...

Step-by-Step Guide to Use OpenLLM on AMD GPUs

01 May 2024

OpenLLM is an open-source platform designed to facilitate the deployment and utilization of large language models (LLMs), supporting a wide range of models for diverse applications, whether in cloud environments or on-premises. In this tutorial, we will guide you through the process of starting an LLM server using OpenLLM, enabling interaction with the server from your local machine, with special emphasis on leveraging the capabilities of AMD GPUs.

Read more ...

Inferencing with Mixtral 8x22B on AMD GPUs

01 May 2024

1, May 2024 by

.

Read more ...

Training a Neural Collaborative Filtering (NCF) Recommender on an AMD GPU

30 April 2024

30, Apr 2024 by

.

Read more ...

Table Question-Answering with TaPas

26 April 2024

26 Apr, 2024 by

.

Read more ...

Multimodal (Visual and Language) understanding with LLaVA-NeXT

26 April 2024

26, Apr 2024 by

.

Read more ...

Unlocking Vision-Text Dual-Encoding: Multi-GPU Training of a CLIP-Like Model

24 April 2024

24 Apr, 2024 by

.

Read more ...

Transforming Words into Motion: A Guide to Video Generation with AMD GPU

24 April 2024

24 Apr, 2024 by

.

Read more ...

Inferencing with AI2’s OLMo model on AMD GPU

17 April 2024

In this blog, we will show you how to generate text using AI2’s OLMo model on AMD GPU.

Read more ...

Text Summarization with FLAN-T5

16 April 2024

In this blog, we showcase the language model FLAN-T5 and how to fine-tune it on a summarization task with HuggingFace in an AMD GPUs + ROCm system.

Read more ...

Speech-to-Text on an AMD GPU with Whisper

16 April 2024

16 Apr, 2024 by

.

Read more ...

PyTorch C++ Extension on AMD GPU

16 April 2024

16, Apr 2024 by

.

Read more ...

Programming AMD GPUs with Julia

16 April 2024

Julia is a high-level, general-purpose dynamic programming language that automatically compiles to efficient native code via LLVM, and supports multiple platforms. With LLVM, comes the support for programming GPUs, including AMD GPUs.

Read more ...

Program Synthesis with CodeGen

16 April 2024

CodeGen is a family of standard transformer-based auto-regressive language models for program synthesis, which as defined by the authors as a method for generating computer programs that solve specified problems, using input-output examples or natural language descriptions.

Read more ...

Interacting with Contrastive Language-Image Pre-Training (CLIP) model on AMD GPU

16 April 2024

16, Apr 2024 by

.

Read more ...

Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs

16 April 2024

16 Apr, 2024 by

.

Read more ...

Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama Model on a single AMD GPU

15 April 2024

15, Apr 2024 by

.

Read more ...

Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU

15 April 2024

15, Apr 2024 by

.

Read more ...

Developing Triton Kernels on AMD GPUs

15 April 2024

19 March 2025

OpenAI has developed a powerful GPU focused programming language and compiler called Triton that works seamlessly with AMD GPUs. The goal of Triton is to enable AI engineers and scientists to write high-performant GPU code with minimal expertise. Triton kernels are performant because of their blocked program representation, allowing them to be compiled into highly optimized binary code. Triton also leverages Python for kernel development, making it both familiar and accessible. And the kernels can be easily compiled by simply declaring the triton.jit python decorator before the kernel.

Read more ...

GPU Unleashed: Training Reinforcement Learning Agents with Stable Baselines3 on an AMD GPU in Gymnasium Environment

11 April 2024

11 Apr, 2024 by

.

Read more ...

ResNet for image classification using AMD GPUs

09 April 2024

9 Apr, 2024 by

.

Read more ...

Small language models with Phi-2

08 April 2024

Like many other LLMs, Phi-2 is a transformer-based model with a next-word prediction objective that is trained on billions of tokens. At 2.7 billion parameters, Phi-2 is a relatively small language model, but it achieves outstanding performance on a variety of tasks, including common sense reasoning, language understanding, math, and coding. For reference, GPT 3.5 has 175 billion parameters and the smallest version of LLaMA-2 has 7 billion parameters. According to Microsoft, Phi-2 is capable of matching or outperforming models up to 25 times larger due to more carefully curated training data and model scaling.

Read more ...

Using the ChatGLM-6B bilingual language model with AMD GPUs

04 April 2024

ChatGLM-6B is an open bilingual (Chinese-English) language model with 6.2 billion parameters. It’s optimized for Chinese conversation based on General Language Model (GLM) architecture. GLM is a pretraining framework that seeks to combine the strengths of autoencoder models (like BERT) and autoregressive models (like GPT). The GLM framework randomly blanks out continuous spans of tokens from the input text (autoencoding methodology) and trains the model to sequentially reconstruct the spans (autoregressive pretraining methodology).

Read more ...

Total body segmentation using MONAI Deploy on an AMD GPU

04 April 2024

4, Apr 2024 by

.

Read more ...

Retrieval Augmented Generation (RAG) using LlamaIndex

04 April 2024

4, Apr 2024 by

.

Read more ...

Image classification using Vision Transformer with AMD GPUs

04 April 2024

4 Apr, 2024 by

.

Read more ...

Building semantic search with SentenceTransformers on AMD

04 April 2024

4 Apr, 2024 by

.

Read more ...

Scale AI applications with Ray

01 April 2024

1, Apr 2024 by

Logan Grado, {hoverxref}Eliot Li.

Read more ...

Automatic mixed precision in PyTorch using AMD GPUs

29 March 2024

As models increase in size, the time and memory needed to train them–and consequently, the cost–also increases. Therefore, any measures we take to reduce training time and memory usage can be highly beneficial. This is where Automatic Mixed Precision (AMP) comes in.

Read more ...

Large language model inference optimizations on AMD GPUs

15 March 2024

15, Mar 2024 by

.

Read more ...

Building a decoder transformer model on AMD GPU(s)

12 March 2024

12, Mar 2024 by

.

Read more ...

Question-answering Chatbot with LangChain on an AMD GPU

11 March 2024

11, Mar 2024 by

.

Read more ...

Music Generation With MusicGen on an AMD GPU

08 March 2024

MusicGen is an autoregressive, transformer-based model that predicts the next segment of a piece of music based on previous segments. This is a similar approach to language models predicting the next token.

Read more ...

Efficient image generation with Stable Diffusion models and ONNX Runtime using AMD GPUs

23 February 2024

23 Feb, 2024 by

.

Read more ...

Simplifying deep learning: A guide to PyTorch Lightning

08 February 2024

8, Feb 2024 by

.

Read more ...

Two-dimensional images to three-dimensional scene mapping using NeRF on an AMD GPU

07 February 2024

7, Feb 2024 by

.

Read more ...

Using LoRA for efficient fine-tuning: Fundamental principles

05 February 2024

5, Feb 2024 by

.

Read more ...

Fine-tune Llama model with LoRA: Customizing a large language model for question-answering

01 February 2024

1, Feb 2024 by

.

Read more ...

Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering

01 February 2024

1, Feb 2024 by

.

Read more ...

Pre-training BERT using Hugging Face & TensorFlow on an AMD GPU

29 January 2024

29, Jan 2024 by

.

Read more ...

Pre-training BERT using Hugging Face & PyTorch on an AMD GPU

26 January 2024

26, Jan 2024 by

.

Read more ...

Accelerating XGBoost with Dask using multiple AMD GPUs

26 January 2024

26 Jan, 2024 by

.

Read more ...

LLM distributed supervised fine-tuning with JAX

25 January 2024

25 Jan, 2024 by

.

Read more ...

Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs

24 January 2024

24 Jan, 2024 by

.

Read more ...

Efficient image generation with Stable Diffusion models and AITemplate using AMD GPUs

24 January 2024

24 Jan, 2024 by

.

Read more ...

Efficient deployment of large language models with Text Generation Inference on AMD GPUs

24 January 2024

24 Jan, 2024 by

.

Read more ...

Creating a PyTorch/TensorFlow code environment on AMD GPUs

11 September 2023

Goal: The machine learning ecosystem is quickly exploding and we aim to make porting to AMD GPUs simple with this series of machine learning blogposts.

Read more ...