Posts by Chun Fang
Accelerating Large-Scale LLM Inference on AMD Instinct MI350X/MI355X with Eagle3 and AMD Quark
- 03 July 2026
Large language model (LLM) inference is increasingly constrained by autoregressive decoding. Even when prefill is highly optimized, the decode phase still generates tokens one step at a time, and each step typically requires running the full target model. For large mixture-of-experts and attention-heavy models such as Kimi-K2.5 and MiniMax-M2.5, this sequential pattern limits serving throughput and increases latency for real-time applications.
Scaling AI Inference Performance with vLLM on AMD Instinct MI355X GPUs
- 08 December 2025
Today, we are excited to share Large Language Model (LLM) Inference Performance with vLLM on AMD Instinctâ„¢ MI355X GPUs. Whether you are a startup, an enterprise or a hyperscaler, the AMD open software ecosystem with Instinct MI355X GPUs delivers consistent, high-performance inference at scale outperforming Nvidia Blackwell B200 GPUs as concurrency grows. For real-world users, this performance impact is directly proportional to user experience and cost efficiency in production environments.