Posts by Zhaofeng Zhang

Advanced MXFP4 Quantization: Combining Fine-Tuned Rotations with SmoothQuant for Near-Lossless Compression

As language models continue to grow in popularity, reducing the cost of inference and accelerating model serving have become key challenges. Quantization offers a powerful solution by reducing the model size and leveraging inexpensive math operations, for example, using low-bitwidth formats like OCP MXFP4 (4.25 bits) available in AMD Instinct MI350X and MI355X accelerators.

Read more ...


High-Accuracy MXFP4, MXFP6, and Mixed-Precision Models on AMD GPUs

Low-bit quantization has become increasingly important for large language models (LLMs), as model sizes reach hundreds of billions of parameters, where balancing efficiency and accuracy is critical. AMD Quark, the model optimization toolkit from AMD, offers cross-platform optimized models for accurate low-bit model deployment. Building on the concepts we introduced in our previous blog, this blog focuses on MXFP4 and MXFP6 low-precision quantization techniques on large language models and demonstrates how to use Quark to compress LLMs for accurate and efficient deployment on AMD Instinctâ„¢ MI355 GPUs.

Read more ...