Posts tagged Hardware

ROCm Becomes a First-Class Platform in the vLLM Ecosystem

21 January 2026

As the generative AI ecosystem matures, vLLM embraces a multivendor ecosystem. The quality of support across hardware platforms becomes a defining priority: developers expect consistent, high-performance behavior no matter which GPU they choose. Today, we are proud to announce a major realization of that vision: AMD ROCm™ is now a first-class platform in the vLLM ecosystem.

Read more ...

Quickly Developing Powerful Flash Attention Using TileLang on AMD Instinct MI300X GPU

20 January 2026

Against the backdrop of the rapid development of the AMD ROCm™ software ecosystem, the high barrier to operator development has long been a bottleneck. The emergence of TileLang provides developers with an efficient solution. As an emerging AI operator development framework, tilelang encapsulates low-level GPU details with concise syntax, enabling developers to fully tap into the computing potential of AMD GPUs without requiring in-depth knowledge of low-level languages such as HIP. The AMD Instinct™ MI300X GPU, as a flagship GPU for AI workloads, boasts ultra-high bandwidth memory and powerful compute units, but it requires adaptive high-performance operators to unleash its capabilities. In this blog, we will take Flash Attention, a key kernel in both LLM training and inference, as an example to fully demonstrate the development process based on TileLang on the MI300X, highlighting the dual benefits of efficiency and performance that TileLang brings to AMD operator development.

Read more ...

Applying Compute Partitioning for Workloads on MI300X GPUs

14 January 2026

This blog explains how to use AMD GPU compute partitioning to increase throughput, utilization and reduce time-to-results for two different types of workloads:

Read more ...

Matrix Core Programming on AMD CDNA™3 and CDNA™4 architecture

30 September 2025

In this blog post, we walk through how to use Matrix Cores in HIP kernels, with a focus on low-precision data types such as FP16, FP8, and FP4, as well as the new family of Matrix Core instructions with exponent block scaling introduced in the AMD CDNA™4 architecture. Through code examples and illustrations, we provide the necessary knowledge to start programming Matrix Cores, covering modern low-precision floating-point types, the Matrix Core compiler intrinsics, and the data layouts required by the Matrix Core instructions.

Read more ...

Introducing Instella: New State-of-the-art Fully Open 3B Language Models

05 March 2025

AMD is excited to announce Instella, a family of fully open state-of-the-art 3-billion-parameter language models (LMs) trained from scratch on AMD Instinct™ MI300X GPUs. Instella models outperform existing fully open models of similar sizes and achieve competitive performance compared to state-of-the-art open-weight models such as Llama-3.2-3B, Gemma-2-2B, and Qwen-2.5-3B, including their instruction-tuned counterparts.

Read more ...

Understanding Peak, Max-Achievable & Delivered FLOPs, Part 1

14 February 2025

The purpose of this blog post is to provide information on the differences between Peak FLOPs and Max-achievable FLOPs. After reading, users will know how AMD measures maximum delivered performance, and how AMD recommends measured device performance is used.

Read more ...