Zhenyu Gu

Zhenyu Gu#

Zhenyu leads the training at scale team at AMD. He has strong experience in building high performance AI/ML infrastructure at scale that cover the end-to-end AI/ML stack, especially rich experience of GPU clusters at scale . He led several 100B+ LLM pre-training/post-training/Inference serving projects. Zhenyu got his Ph.D. from EECS Dept, Northwestern University.

Posts by Zhenyu Gu

July 13, 2026

Triton-Based Optimization of Video Sparse Attention on ROCm

Optimize video sparse attention on ROCm with GEAK and linear global context for faster, more stable video generation on AMD GPUs.

https://rocm.blogs.amd.com/artificial-intelligence/rocm-vsa/README.html

July 06, 2026

Primus Tuning Agent: Closing the Configuration-Search Loop

Use the Primus Tuning Agent to automatically find optimal LLM training configurations on AMD Instinct GPUs.

https://rocm.blogs.amd.com/software-tools-optimization/primus-tuning-agent/README.html

July 03, 2026

AgentKernelArena: Benchmarking AI Coding Agents for GPU Kernel Optimization on AMD Instinct GPUs

Explore how AI coding agents compare on real GPU kernel optimization with AgentKernelArena, AMD's open benchmarking arena for Instinct™ GPUs.

https://rocm.blogs.amd.com/software-tools-optimization/agent-kernel-arena/README.html

June 10, 2026

Dropless MoE Training in JAX with Primus-Turbo

Learn how to train dropless MoE in JAX/MaxText with Primus-Turbo's grouped GEMM and DeepEP all-to-all for faster, more memory-efficient training.

https://rocm.blogs.amd.com/software-tools-optimization/maxtext-dropless-moe/README.html

May 29, 2026

Enabling Speculative Speculative Decoding on MI300X

This is an introduction of speculative speculative decoding method. We enable this method on the AMD Instinct MI300x GPUs and report the results.

https://rocm.blogs.amd.com/artificial-intelligence/ssd_mi300x/README.html

April 24, 2026

Primus Projection: Estimate Memory and Performance Before You Train

Learn how to use the Primus projection tool to estimate memory and performance for large-scale LLM training on AMD Instinct™ accelerator platforms.

https://rocm.blogs.amd.com/software-tools-optimization/primus-projection/README.html

March 09, 2026

Agentic Diagnosis for LLM Training at Scale

Explore how AI agents diagnose LLM training incidents — from RCCL hangs to throughput regressions — in one prompt with MaxText-Slurm.

https://rocm.blogs.amd.com/software-tools-optimization/maxtext-slurm-agentic-diagnosis/README.html

March 02, 2026

MaxText-Slurm: Production-Grade LLM Training with Built-In Observability

MaxText-Slurm: A unified launch system for production-grade LLM training with observability on AMD GPU clusters.

https://rocm.blogs.amd.com/software-tools-optimization/maxtext-slurm/README.html

February 23, 2026

Primus-Pipeline: A More Flexible and Scalable Pipeline Parallelism Implementation

Learn how to use our flexible and scalable pipeline parallelism framework with Primus backend and AMD hardware.

https://rocm.blogs.amd.com/software-tools-optimization/primus-pipeline/README.html

February 08, 2026

Resilient Large-Scale Training: Integrating TorchFT with TorchTitan on AMD GPUs

Achieve resilient, checkpoint-less distributed training on AMD GPUs by integrating TorchFT with TorchTitan on Primus-SaFE.

https://rocm.blogs.amd.com/artificial-intelligence/primus-torchft/README.html

December 16, 2025

MoE Training Best Practices on AMD GPUs

Learn how to optimize Mixture-of-Experts (MoE) model training on AMD Instinct GPUs with ROCm. Maximize your AI training performance now!

https://rocm.blogs.amd.com/software-tools-optimization/primus-moe-package/README.html

November 04, 2025

Stability at Scale: AMD’s Full‑Stack Platform for Large‑Model Training

Primus streamlines LLM training on AMD GPUs with unified configs, multi-backend support, preflight validation, and structured logging.

https://rocm.blogs.amd.com/software-tools-optimization/primus-SaFE/README.html

September 19, 2025

An Introduction to Primus-Turbo: A Library for Accelerating Transformer Models on AMD GPUs

Primus streamlines training on AMD ROCm, from fine-tuning to massive pretraining on MI300X GPUs—faster, safer, and easier to debug

https://rocm.blogs.amd.com/software-tools-optimization/primus-large-models/README.html

August 05, 2025

Day 0 Developer Guide: Running the Latest Open Models from OpenAI on AMD AI Hardware

Day 0 support across our AI hardware ecosystem from our flagship AMD InstinctTM MI355X and MI300X GPUs, AMD Radeon™ AI PRO R700 GPUs and AMD Ryzen™ AI Processors

https://rocm.blogs.amd.com/ecosystems-and-partners/openai-day-0/README.html