Posts by Felix Li

Accelerating LLM Inference on AMD GPUs with Low-Latency GEMMs

29 June 2026

Large language model inference is becoming increasingly interactive. Users expect chatbots, coding assistants, agents, and real-time copilots to respond quickly, stream tokens smoothly, and stay responsive under concurrent load. In that setting, decode-time latency is not just a backend metric. It directly affects perceived quality.

Read more ...

Getting Started with FlyDSL Nightly Wheels on ROCm

20 April 2026

In the previous post on FlyDSL, we introduced the motivation behind FlyDSL and how it enables Python-native GPU kernel development using the AMD ROCm™ software stack. FlyDSL combines the flexibility of Python with the performance of MLIR and LLVM-based compilation, allowing developers to write GPU kernels in Python while targeting modern AMD hardware.

Read more ...

Accelerating Kimi-K2.5 on AMD Instinct™ MI300X: Optimizing Fused MoE with FlyDSL

24 March 2026

With the recent surge in popularity of OpenClaw [1], its officially recommended model, Kimi-K2.5 [2], has taken the AI community by storm. As developers and researchers flock to this powerful Mixture-of-Experts (MoE) LLM, the need for high-performance inference on cutting-edge hardware has never been more critical.

Read more ...

FlyDSL: Expert GPU Kernel Development with the Ease of MLIR Python Native DSL on AMD GPUs

20 February 2026

30 March 2026

The AMD ROCm™ software ecosystem continues to grow rapidly as developers build new kernels, compilers, and AI frameworks optimized for AMD GPUs. As workloads become more complex and the demand for both performance and agility increases, a clear need has emerged for a modern, flexible, and open GPU kernel authoring framework.

Read more ...