Posts by Lei Zhang

Resilient Large-Scale Training: Integrating TorchFT with TorchTitan on AMD GPUs

08 February 2026

Training large AI models on AMD GPUs demands unwavering stability and robust fault-tolerance capabilities at cluster scale. Yet today’s ROCm-based multi-node GPU deployments often rely on brittle checkpoint-and-restart mechanisms to recover from failures. This approach wastes precious compute cycles and slows down training as model sizes and cluster scales grow. To address these challenges, we integrated PyTorch’s native fault-tolerance framework—TorchFT—with the TorchTitan training framework on AMD’s Primus-SaFE Kubernetes platform, achieving resilient, checkpoint-less training at hundred-GPU scale. This blog builds upon our previous work on the Primus ecosystem—for background on the platform architecture, see our earlier posts on Primus-SaFE, the Primus training framework, and training large models with Primus.

Read more ...

Unleash Full GPU Potential: Overlap Communication and Computation with Triton-Distributed

06 May 2025

In distributed computing, AI workloads demand both massive parallelism and efficient data movement. A primary challenge lies in effectively overlapping computation with communication to maximize performance. GPUs are excellent at crunching numbers. However, their full potential often remains untapped due to relatively long inter-GPU communication. This results in their computing units staying idle for large amounts of time while waiting for data transfer from other nodes. In this blog, we will show how you can use the Triton-Distributed framework to generate kernels that overlap communication and computation, resulting in performance that can rival highly optimized libraries.

Read more ...