Posts by Hattie Wu

Scaling MiniMax-M3 Inference with Distributed Serving and Operator Co-Design on AMD Instinct MI355X GPUs

21 July 2026

This blog walks you through a set of MiniMax-M3 inference optimizations on AMD Instinct™ MI355X GPUs using ATOM, AITER, and ATOMesh. For broader background on the inference engine and distributed serving layers used here, see the ROCm blogs on ATOM and ATOMesh.

Read more ...

Efficient MiniMax-M3 Inference on AMD Instinct GPUs with ATOM and ATOMesh

21 July 2026

MiniMax-M3 is a natively multimodal Mixture-of-Experts (MoE) foundation model with 428 billion total parameters and 22 billion activated per token. Released in June 2026, M3 is the first open-weight model to combine frontier coding and agentic capabilities, a 1-million-token context window, and native text, image, and video understanding within a single architecture. Its key innovation, MiniMax Sparse Attention (MSA), replaces standard quadratic attention with a block-level KV cache selection mechanism, reducing per-token compute to 1/20th of its predecessor at 1M context.

Read more ...

SGLang-ATOM: Bring ROCm-Native Acceleration to SGLang Serving

08 July 2026

Large language model serving teams often face two competing goals: keeping the flexibility and developer velocity of an ecosystem serving framework, while also reaching strong throughput, latency, and cost efficiency on production accelerators. In this blog, you will explore how SGLang-ATOM bridges these needs for AMD Instinct GPUs by connecting the SGLang serving experience with ATOM’s ROCm-native execution path.

Read more ...

Accelerating LLM Inference on AMD GPUs with Low-Latency GEMMs

29 June 2026

Large language model inference is becoming increasingly interactive. Users expect chatbots, coding assistants, agents, and real-time copilots to respond quickly, stream tokens smoothly, and stay responsive under concurrent load. In that setting, decode-time latency is not just a backend metric. It directly affects perceived quality.

Read more ...

ATOMesh: Unlocking AMD Hardware for Scalable LLM Serving

16 June 2026

Large language model serving is moving from single-engine optimization to full-stack distributed inference. Production deployments must handle high concurrency, long-context prefill, latency-sensitive decode, KV cache store pressure, and multi-node GPU utilization at the same time. On AMD Instinct GPUs, the key opportunity is to connect ROCm-native kernels, communication libraries, inference engines, and distributed orchestration into one scalable serving stack.

Read more ...

ATOM: Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization

15 June 2026

As LLM serving enters a phase defined by high concurrency, long-context workloads, sparse MoE activation, and multi-GPU deployment, the challenge is no longer basic functionality but sustaining peak efficiency on AMD GPUs under production-scale load. ATOM (AiTer Optimized Model) is built for that goal, following four core principles: system-level optimization for LLM inference on AMD Instinct™ GPUs, kernel-level acceleration through AITER, distributed inference scaling with MORI, and a rollout-engine path for RL workloads. It builds on earlier ROCm blog coverage of AITER and vLLM-ATOM, moving from kernel and plugin acceleration into the standalone ATOM inference engine. Rather than being a generic framework adapted to the ROCm™ software, ATOM is an execution engine designed with ROCm-first priorities, AITER-native operators, and deep optimization on the inference-critical path. Aligned with the AMD Instinct roadmap from single-node optimization to multi-node scale-out, ATOM evolves its architecture, kernel strategy, and distributed execution model in lockstep with each hardware generation.

Read more ...

vLLM-ATOM: Unlocking Native AMD Performance in the vLLM Ecosystem

07 May 2026

This blog walks you through vLLM-ATOM, the AMD-optimized plugin that supercharges vLLM on Instinct GPUs.

Read more ...