Posts by Xiaoming Peng

Primus-Pipeline: A More Flexible and Scalable Pipeline Parallelism Implementation

23 February 2026

Error parsing meta tag attribute “keywords”: No content.

MoE Training Best Practices on AMD GPUs

16 December 2025

This blog covers best practices for training Mixture-of-Experts (MoE) models on AMD Instinct™ MI300/MI355-series^[a] GPUs with the ROCm ecosystem. Whether you’re new to MoE distributed architectures or optimizing trillion-parameter models, this guide will help you identify bottlenecks and maximize efficiency on AMD hardware.

Read more ...

Primus: A Lightweight, Unified Training Framework for Large Models on AMD GPUs

22 August 2025

Training large language models (LLMs) at scale is inherently complex. Different frameworks expose inconsistent interfaces, multi-GPU and distributed setups require brittle scripting, and backend-specific quirks introduce overhead that slows down training iterations. Primus tackles these challenges with a streamlined, backend-agnostic training framework that helps developers launch, customize, and scale training jobs faster on AMD GPUs.

Read more ...