Posts by Bill He

Practical, Fault‑Robust Distributed Inference for DeepSeek on AMD MI300X

12 November 2025

As large scale LLM inference moves beyond a single server, engineering teams face a familiar trifecta of challenges: performance, fault isolation, and operational efficiency. DeepSeek‑V3/R1’s high‑sparsity Mixture‑of‑Experts (MoE) architecture can deliver excellent throughput, but only when computation, memory, and communication are orchestrated with care—especially across multiple nodes [1].

Read more ...

Unleashing AMD Instinct™ MI300X GPUs for LLM Serving: Disaggregating Prefill & Decode with SGLang

28 August 2025

LLM inference pipelines are hitting a scalability wall as prefill and decode phases compete for the same compute, causing latency spikes and underutilized resources. DistServe tackles this by disaggregating prefill and decode computation across separate GPUs—eliminating interference, decoupling resource planning, and unlocking new levels of optimization for both time-to-first-token (TTFT) and time-per-output-token (TPOT).

Read more ...

Power Up Qwen 3 with AMD Instinct: A Developer’s Day 0 Quickstart

28 April 2025

AMD is excited to announce Day 0 support for Alibaba’s latest Large Language Models Qwen3-235B Qwen3-32B Qwen3-30B on AMD Instinct™ MI300X GPU accelerators using vLLM and SGLang. In this blog we show you how to accelerate Alibaba’s cutting-edge Qwen 3 language models, featuring advanced reasoning, multilingual capabilities, and agent functionality, using AMD Instinct™ MI300X GPUs. You will learn to deploy dense and Mixture-of-Experts models with full support for vLLM and SGLang, leveraging AMD’s advanced GPU architecture for high-throughput, low-latency inference.

Read more ...