Posts by Haichen Zhang
Accelerating Large-Scale LLM Inference on AMD Instinct MI350X/MI355X with Eagle3 and AMD Quark
- 03 July 2026
Large language model (LLM) inference is increasingly constrained by autoregressive decoding. Even when prefill is highly optimized, the decode phase still generates tokens one step at a time, and each step typically requires running the full target model. For large mixture-of-experts and attention-heavy models such as Kimi-K2.5 and MiniMax-M2.5, this sequential pattern limits serving throughput and increases latency for real-time applications.