Posts by Ke Wang

Serving NVFP4 Models on AMD Instinct™ MI355 Accelerators

13 July 2026

NVFP4 is an increasingly common deployment format: NVIDIA, AMD, and the open-source community have published NVFP4 quantized checkpoints of frontier models such as moonshotai/Kimi-K2.6, and many users want to deploy these checkpoints directly. AMD Instinct™ MI355 is built on the CDNA4 architecture, which has no native NVFP4 tensor execution path — meaning these checkpoints could not previously be served on MI355 without an expensive offline conversion to a different format.

Read more ...

QuickReduce INT3 Quantization and Benchmarking on MI355

13 July 2026

Large Language Models (LLMs) typically contain billions — or even tens of billions — of parameters. During inference, tensor parallelism (TP) is a widely used technique that distributes the compute across multiple GPUs. This approach, however, requires frequent, large-scale data synchronization between layers, introducing significant communication latency and placing enormous pressure on interconnect bandwidth.

Read more ...

Accelerating Diffusers and xDiT Image Generation with MXFP4 using AMD Quark on AMD Instinct™ MI350 GPUs

06 July 2026

Diffusion models such as Black Forest Labs’ FLUX.1-dev [1] deliver stunning image quality but demand significant compute and memory bandwidth at inference time. To reduce inference cost without sacrificing image quality, precision-aware quantization techniques have become a critical optimization strategy.

Read more ...

QuickReduce FP4 Quantization and Benchmarking on MI355

20 May 2026

Large Language Models (LLMs) typically contain billions — or even tens of billions — of parameters. During inference, tensor parallelism is commonly employed to distribute the workload across multiple GPUs. This approach demands frequent, large-scale data synchronization between layers, introducing significant communication latency and placing enormous pressure on interconnect bandwidth.

Read more ...

High-Accuracy MXFP4, MXFP6, and Mixed-Precision Models on AMD GPUs

29 October 2025

Low-bit quantization has become increasingly important for large language models (LLMs), as model sizes reach hundreds of billions of parameters, where balancing efficiency and accuracy is critical. AMD Quark, the model optimization toolkit from AMD, offers cross-platform optimized models for accurate low-bit model deployment. Building on the concepts we introduced in our previous blog, this blog focuses on MXFP4 and MXFP6 low-precision quantization techniques on large language models and demonstrates how to use Quark to compress LLMs for accurate and efficient deployment on AMD Instinct™ MI355 GPUs.

Read more ...

QuickReduce: Up to 3x Faster All-reduce for vLLM and SGLang

26 August 2025

Advancements in large-scale language models (LLMs) have led to significant performance breakthroughs across various domains, especially in natural language processing. LLMs typically consist of billions of parameters, resulting in substantial computational, storage, and deployment challenges. Inter-GPU communication overhead often emerges as a key bottleneck limiting overall system performance. In tensor-parallel setups, every layer requires frequent all-reduce operations—synchronizing large amounts of data across GPUs. This introduces significant latency and strains interconnect bandwidth.

Read more ...