Posts by Eveline Chen

Serve Kimi-K2.5-MXFP4 on MI355X with ATOM

23 July 2026

In our previous Kimi-K2.5 posts, we moved from fused MoE optimization with FlyDSL [8] to W4A8 and W8A8 quantization with AMD Quark [9]. Those posts focused on kernel and quantization work: how to make the dominant MoE path faster, and how to trade off INT4, INT8, and FP8 formats on AMD Instinct™ MI300X and MI325X GPUs.

Read more ...

When a Faster Kernel Doesn’t Speed Up Serving: Profiling FP8 KV Cache on AMD Instinct MI308X

15 July 2026

This case study starts with a result that didn’t add up. We enabled FP8 KV cache (--kv-cache-dtype fp8_e4m3) on a Kimi-K2.5-W4A8 (MoE + MLA) deployment on 8× AMD Instinct MI308X. At first glance, the trace had several encouraging signs: the MLA decode kernel ran 34% faster[1] than the BF16 baseline, going from 0.190 ms/call to 0.125 ms/call, and several existing categories such as GEMM, communication, and elementwise work moved slightly lower.

Read more ...

Faster Kimi-K2.5-W4A8 Decoding with EAGLE3 on AMD Instinct™ MI325X

23 June 2026

In our previous blog [7], we deployed Kimi-K2.5 [1] in W4A8 (INT4 weights + INT8 activations) on AMD Instinct™ MI325X, replacing the BF16 MFMA path in the fused MoE kernel with FlyDSL [2]’s INT8 MFMA implementation. The remaining bottleneck is the autoregressive nature of decoding itself: even with INT8 MFMA and INT4 weights, the framework still runs one full forward pass per generated token.

Read more ...

Further Accelerating Kimi-K2.5 on AMD Instinct™ MI325X: W4A8 & W8A8 Quantization with AMD Quark

14 May 2026

In our previous blog [7], we demonstrated how to accelerate Kimi-K2.5 [1] inference on AMD Instinct™ GPUs by profiling the model, identifying fused_moe as the dominant bottleneck (consuming 88–90% of GPU time), and replacing the default Triton-based kernel with a FlyDSL [2]-powered mixed-precision (BF16 + W4A16) fused MoE implementation.

Read more ...

Accelerating Kimi-K2.5 on AMD Instinct™ MI300X: Optimizing Fused MoE with FlyDSL

24 March 2026

With the recent surge in popularity of OpenClaw [1], its officially recommended model, Kimi-K2.5 [2], has taken the AI community by storm. As developers and researchers flock to this powerful Mixture-of-Experts (MoE) LLM, the need for high-performance inference on cutting-edge hardware has never been more critical.

Read more ...