Posts by Yao Fu

Stability at Scale: AMD’s Full‑Stack Platform for Large‑Model Training

04 November 2025

Training large AI models on AMD GPUs demands unwavering stability and robust debugging capabilities at cluster scale. Yet today’s ROCm-based multi-node GPU deployments often rely on brittle scripts and disjointed tools to launch distributed jobs, monitor performance, and recover from failures. This patchwork approach makes troubleshooting difficult and undermines cluster-wide reliability as model sizes and run times grow.

Read more ...

An Introduction to Primus-Turbo: A Library for Accelerating Transformer Models on AMD GPUs

19 September 2025

With the rapid growth of large-scale models, acceleration libraries are facing higher demands: they must deliver exceptional performance, offer comprehensive functionality, and remain easy to use. To meet these needs, we introduce Primus-Turbo — part of the Primus product family (see our previous blog for background). Primus-Turbo is designed around three core principles: performance, completeness, and ease of use. It supports training, inference, and a wide range of application scenarios, providing developers with a solid foundation to efficiently build and optimize large models on the ROCm platform. See Figure 1 below for a comprehensive stack coverage of Primus-Turbo.

Read more ...

Primus: A Lightweight, Unified Training Framework for Large Models on AMD GPUs

22 August 2025

Training large language models (LLMs) at scale is inherently complex. Different frameworks expose inconsistent interfaces, multi-GPU and distributed setups require brittle scripting, and backend-specific quirks introduce overhead that slows down training iterations. Primus tackles these challenges with a streamlined, backend-agnostic training framework that helps developers launch, customize, and scale training jobs faster on AMD GPUs.

Read more ...

Day 0 Developer Guide: Running the Latest Open Models from OpenAI on AMD AI Hardware

05 August 2025

OpenAI has officially released its open models: gpt-oss-120b and gpt-oss-20b. AMD now provides out-of-the-box, day 0 support for the latest open models from OpenAI, enabling developers to easily fine-tune and deploy across cloud to client environments using AMD hardware, the AMD ROCm™ and AMD Ryzen™ AI software stack, and seamless open source integrations. At AMD, we’re excited to announce day 0 support across our AI hardware, including our flagship AMD Instinct™ MI355X and MI300X GPUs, AMD Radeon™ AI PRO R9700 GPUs, and AMD Ryzen™ AI processors.

Read more ...

Optimized ROCm Docker for Distributed AI Training

13 March 2025

This blog will introduce you to the updated AMD Docker image, specifically built and optimized for distributed training. As you will see, the optimized AMD ROCm Docker image makes training large AI models faster and more efficient. It includes updates such as better fine-tuning tools, improved performance for multi-GPU setups, and support for FP8 precision, which helps speed up training while using less memory, and can provide you with an overall smoother and more efficient training experience on popular models such as Flux and Llama 3.1 running on AMD GPUs.

Read more ...