Carlus Huang

Carlus Huang#

Carlus is a Principal Member of Technical Staff in AMD’s AI Group, focusing on AI software architecture for the ROCm ecosystem. He has over 10 years of experience specializing in GPU software enablement and performance optimization on AMD platforms.

Posts by Carlus Huang

July 21, 2026

Scaling MiniMax-M3 Inference with Distributed Serving and Operator Co-Design on AMD Instinct MI355X GPUs

Optimize MiniMax-M3 inference on AMD Instinct™ MI355X GPUs with ATOM online quantization, AITER sparse attention, FP8 KV cache, and EAGLE3.

https://rocm.blogs.amd.com/software-tools-optimization/minimax-m3-mi355/README.html

July 08, 2026

SGLang-ATOM: Bring ROCm-Native Acceleration to SGLang Serving

Explore how SGLang-ATOM connects SGLang serving applications with ROCm-native ATOM execution to accelerate LLM inference on AMD Instinct GPUs.

https://rocm.blogs.amd.com/software-tools-optimization/atom-sglang-inference/README.html

June 29, 2026

Accelerating LLM Inference on AMD GPUs with Low-Latency GEMMs

Learn how FlyDSL low-latency GEMMs speed up LLM decode on AMD GPUs with Split-K, K-slice parallelism, and an LDS-based pipeline.

https://rocm.blogs.amd.com/software-tools-optimization/accelerating-llm-inference-on-amd-gpus-with-low-latency-gemms/README.html

June 24, 2026

DP Attention and TBO for DeepSeek-V4 on MI355X

Learn how ATOM improves DeepSeek-V4 inference on AMD Instinct MI355X GPUs with DP Attention scheduling and Two-Batch Overlap.

https://rocm.blogs.amd.com/software-tools-optimization/atom-optimiztion/README.html

June 15, 2026

ATOM: Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization

A technical walkthrough of ATOM on AMD Instinct GPUs, covering architecture, feature scope, model coverage, and practical benchmark dashboard usage.

https://rocm.blogs.amd.com/software-tools-optimization/atom-inference-engine/README.html

May 07, 2026

vLLM-ATOM: Unlocking Native AMD Performance in the vLLM Ecosystem

Use ATOM as an out-of-tree vLLM plugin to keep vLLM compatibility while enabling AMD-optimized attention, model execution, and multi-model support including Kimi-K2.5.

https://rocm.blogs.amd.com/software-tools-optimization/vllm-atom/README.html

April 20, 2026

Getting Started with FlyDSL Nightly Wheels on ROCm

A practical guide to installing and using FlyDSL nightly wheels on ROCm for fast, Python-native GPU kernel development

https://rocm.blogs.amd.com/software-tools-optimization/flydsl-nightly-wheel/README.html

February 20, 2026

FlyDSL: Expert GPU Kernel Development with the Ease of MLIR Python Native DSL on AMD GPUs

FlyDSL is a Python-first, MLIR-native DSL for expert GPU kernel development and tuning on AMD GPUs.

https://rocm.blogs.amd.com/software-tools-optimization/flydsl-python-native/README.html

February 17, 2026

Adaptive Top-K Selection: Eliminating Performance Cliffs Across All K Values on AMD GPUs

Explore adaptive Top-K on MI300X! See how auto-selection and hardware optimizations like DPP and double buffering drive peak efficiency.

https://rocm.blogs.amd.com/software-tools-optimization/adaptive-topk/README.html

November 12, 2025

Practical, Fault‑Robust Distributed Inference for DeepSeek on AMD MI300X

Learn how a small-radius expert parallel design with prefill–decode disaggregation enables scalable, fault-isolated LLM inference on AMD Instinct™ MI300X clusters.

https://rocm.blogs.amd.com/software-tools-optimization/wide-ep-deepseek/README.html

September 30, 2025

Matrix Core Programming on AMD CDNA™3 and CDNA™4 architecture

This blog post explains how to use Matrix Cores on CDNA3 and CDNA4 architecture, with a focus on low-precision data types such as FP16, FP8, and FP4

https://rocm.blogs.amd.com/software-tools-optimization/matrix-cores-cdna/README.html

March 21, 2025

AITER: AI Tensor Engine For ROCm

We introduce AMD's AI Tensor Engine for ROCm (AITER), our centralized high performance AI operators repository, designed to significantly accelerate AI workloads on AMD GPUs

https://rocm.blogs.amd.com/software-tools-optimization/aiter-ai-tensor-engine/README.html