Posts by Anshul Gupta

Day 0 Developer Guide: Running the Latest Open Models from OpenAI on AMD AI Hardware

05 August 2025

OpenAI has officially released its open models: gpt-oss-120b and gpt-oss-20b. AMD now provides out-of-the-box, day 0 support for the latest open models from OpenAI, enabling developers to easily fine-tune and deploy across cloud to client environments using AMD hardware, the AMD ROCm™ and AMD Ryzen™ AI software stack, and seamless open source integrations. At AMD, we’re excited to announce day 0 support across our AI hardware, including our flagship AMD Instinct™ MI355X and MI300X GPUs, AMD Radeon™ AI PRO R9700 GPUs, and AMD Ryzen™ AI processors.

Read more ...

Avoiding LDS Bank Conflicts on AMD GPUs Using CK-Tile Framework

25 July 2025

LDS bank conflict is a common performance bottleneck in GPU kernel development. Composable Kernel (CK-Tile), a kernel development framework for AMD GPUs, provides a framework-level solution for LDS bank conflicts. Composable Kernel for ROCm is used to build portable high-performance kernels for accelerating computing, e.g. HPC, DL and LLMs for training and inference workloads. In this blog, we show you how to analyze, detect, and eliminate LDS bank conflicts using CK-Tile, AMD’s composable GPU kernel framework. A GEMM kernel serves as a classic example for analyzing how threads interact with LDS during both reads and writes. Starting with a naïve memory layout, we evaluate bank conflict behavior, explore mitigation techniques such as padding, and ultimately demonstrate how an XOR-based swizzle transformation achieves a bank conflict-free design.

Read more ...

Accelerated LLM Inference on AMD Instinct™ GPUs with vLLM 0.9.x and ROCm

28 June 2025

AMD is pleased to announce the release of vLLM 0.9.x, delivering significant advances in LLM inference performance through ROCm™ software and AITER integration. This release provides a variety of powerful optimizations and exciting new capabilities to the AMD ROCm software ecosystem as shown in Figure 1, below. Whether you are a developer or a researcher, this release is designed to help you unlock new levels of performance and explore wider model support on AMD Instinct™ GPUs.

Read more ...

AMD Integrates llm-d on AMD Instinct MI300X Cluster For Distributed LLM Serving

20 May 2025

AMD has successfully deployed the open-source llm-d framework on AMD Kubernetes infrastructure as part of our efforts for distributed large language model inference at scale. It leverages Kubernetes-native toolkit to streamline LLM serving with features like KV-cache-aware routing, distributed scheduling, and integration with Inference Gateway (IGW). In this blog we showcase initial deployment on an AMD cluster with distributed prefill and decode stages on a Llama model.

Read more ...

Unleash Full GPU Potential: Overlap Communication and Computation with Triton-Distributed

06 May 2025

In distributed computing, AI workloads demand both massive parallelism and efficient data movement. A primary challenge lies in effectively overlapping computation with communication to maximize performance. GPUs are excellent at crunching numbers. However, their full potential often remains untapped due to relatively long inter-GPU communication. This results in their computing units staying idle for large amounts of time while waiting for data transfer from other nodes. In this blog, we will show how you can use the Triton-Distributed framework to generate kernels that overlap communication and computation, resulting in performance that can rival highly optimized libraries.

Read more ...

AITER: AI Tensor Engine For ROCm

21 March 2025

24 March 2025

Performance optimization is critical when working with GPUs, especially for tasks involving artificial intelligence, which can be extremely demanding. To fully leverage the capabilities of advanced hardware, it’s essential to master optimization strategies and ensure every available resource is utilized efficiently. In this blog we will provide an overview of AMD’s AI Tensor Engine for ROCm (AITER) and show you how easy it is to integrate AITER kernels in basic LLM training and inference workload. AITER helps developers to focus on creating operators while allowing customers to seamlessly integrate this operator collection into their own private, public, or any custom framework.

Read more ...

Deploying Google’s Gemma 3 Model with vLLM on AMD Instinct™ MI300X GPUs: A Step-by-Step Guide

14 March 2025

AMD is excited to announce the integration of Google’s Gemma 3 models with AMD Instinct MI300X GPUs, optimized for high-performance inference using the vLLM framework. This collaboration empowers developers to harness advanced AMD AI hardware for scalable, efficient deployment of state-of-the-art language models. In this blog we will walk you through a step-by-step guide on deploying Google’s Gemma 3 model using vLLM on AMD Instinct GPUs, covering Docker setup, dependencies, authentication, and inference testing. Remember, the Gemma 3 model is gated—ensure you request access before beginning deployment.

Read more ...

Optimized ROCm Docker for Distributed AI Training

13 March 2025

This blog will introduce you to the updated AMD Docker image, specifically built and optimized for distributed training. As you will see, the optimized AMD ROCm Docker image makes training large AI models faster and more efficient. It includes updates such as better fine-tuning tools, improved performance for multi-GPU setups, and support for FP8 precision, which helps speed up training while using less memory, and can provide you with an overall smoother and more efficient training experience on popular models such as Flux and Llama 3.1 running on AMD GPUs.

Read more ...

GEMM Kernel Optimization For AMD GPUs

06 February 2025

Matrix multiplication underlies critical computational pathways in AI, with General Matrix Multiplication (GEMM) operations serving as performance-critical kernels in neural network architectures. From fully connected layers to convolutions and transformer attention mechanisms, GEMMs consume substantial computational and memory resources in large language models (LLMs). This blog explores GEMM optimization techniques for AMD GPUs, demonstrating methodologies to significantly enhance computational efficiency and performance scaling.

Read more ...