Posts by Wen Xie

Primus-Pipeline: A More Flexible and Scalable Pipeline Parallelism Implementation

23 February 2026

Error parsing meta tag attribute “keywords”: No content.

Resilient Large-Scale Training: Integrating TorchFT with TorchTitan on AMD GPUs

08 February 2026

Training large AI models on AMD GPUs demands unwavering stability and robust fault-tolerance capabilities at cluster scale. Yet today’s ROCm-based multi-node GPU deployments often rely on brittle checkpoint-and-restart mechanisms to recover from failures. This approach wastes precious compute cycles and slows down training as model sizes and cluster scales grow. To address these challenges, we integrated PyTorch’s native fault-tolerance framework—TorchFT—with the TorchTitan training framework on AMD’s Primus-SaFE Kubernetes platform, achieving resilient, checkpoint-less training at hundred-GPU scale. This blog builds upon our previous work on the Primus ecosystem—for background on the platform architecture, see our earlier posts on Primus-SaFE, the Primus training framework, and training large models with Primus.

Read more ...

Deep Dive into Primus: High-Performance Training for Large Language Models

15 January 2026

Primus is the AMD unified training framework designed to deliver high-performance, scalable large language models (LLMs) training across multiple backends – including TorchTitan and Megatron-LM. It provides a consistent CLI interface, while each backend ships with carefully optimized configurations for popular open-source models. These backend-specific presets ensure the best out-of-the-box performance on AMD Instinct™ GPUs. In this deep dive, we walk through the best practices for achieving peak performance when training dense LLMs on Primus.

Read more ...

MoE Training Best Practices on AMD GPUs

16 December 2025

This blog covers best practices for training Mixture-of-Experts (MoE) models on AMD Instinct™ MI300/MI355-series^[a] GPUs with the ROCm ecosystem. Whether you’re new to MoE distributed architectures or optimizing trillion-parameter models, this guide will help you identify bottlenecks and maximize efficiency on AMD hardware.

Read more ...

Optimizing LLM Workloads: AMD Instinct MI355X GPUs Drive Competitive Performance

02 December 2025

AI training workloads are pushing the limits of modern GPU architectures. With the release of AMD ROCm™ 7.0 software, AMD is raising the bar for high-performance training by delivering optimized support for LLM workloads across the JAX and PyTorch frameworks. The latest v25.9 Training Dockers demonstrate exceptional scaling efficiency for both single-node and multi-node setups, empowering researchers and developers to push model sizes and complexity further than ever.

Read more ...

An Introduction to Primus-Turbo: A Library for Accelerating Transformer Models on AMD GPUs

19 September 2025

With the rapid growth of large-scale models, acceleration libraries are facing higher demands: they must deliver exceptional performance, offer comprehensive functionality, and remain easy to use. To meet these needs, we introduce Primus-Turbo — part of the Primus product family (see our previous blog for background). Primus-Turbo is designed around three core principles: performance, completeness, and ease of use. It supports training, inference, and a wide range of application scenarios, providing developers with a solid foundation to efficiently build and optimize large models on the ROCm platform. See Figure 1 below for a comprehensive stack coverage of Primus-Turbo.

Read more ...

Primus: A Lightweight, Unified Training Framework for Large Models on AMD GPUs

22 August 2025

Training large language models (LLMs) at scale is inherently complex. Different frameworks expose inconsistent interfaces, multi-GPU and distributed setups require brittle scripting, and backend-specific quirks introduce overhead that slows down training iterations. Primus tackles these challenges with a streamlined, backend-agnostic training framework that helps developers launch, customize, and scale training jobs faster on AMD GPUs.

Read more ...