Posts by Bowen Bao

Advanced MXFP4 Quantization: Combining Fine-Tuned Rotations with SmoothQuant for Near-Lossless Compression

As language models continue to grow in popularity, reducing the cost of inference and accelerating model serving have become key challenges. Quantization offers a powerful solution by reducing the model size and leveraging inexpensive math operations, for example, using low-bitwidth formats like OCP MXFP4 (4.25 bits) available in AMD Instinct MI350X and MI355X accelerators.

Read more ...


High-Accuracy MXFP4, MXFP6, and Mixed-Precision Models on AMD GPUs

Low-bit quantization has become increasingly important for large language models (LLMs), as model sizes reach hundreds of billions of parameters, where balancing efficiency and accuracy is critical. AMD Quark, the model optimization toolkit from AMD, offers cross-platform optimized models for accurate low-bit model deployment. Building on the concepts we introduced in our previous blog, this blog focuses on MXFP4 and MXFP6 low-precision quantization techniques on large language models and demonstrates how to use Quark to compress LLMs for accurate and efficient deployment on AMD Instinct™ MI355 GPUs.

Read more ...


Technical Dive into AMD’s MLPerf Inference v5.1 Submission

In the rapidly evolving landscape of artificial intelligence, the demand for reliable and efficient model inference has never been greater. With advancements in large language models (LLMs) and a growing reliance on real-time applications, benchmarks are critical in evaluating how well AI systems perform under varying conditions. Enter MLPerf Inference: Datacenter v5.1 — a significant update to the well-respected benchmarking suite that assesses inference performance across a wide array of models and use cases, catering especially to data centers.

Read more ...