Posts by Ashish Sirasao

Day 0 Developer Guide: hipBLASLt Offline GEMM Tuning Script

This blog post focuses on optimizing the performance of a real model using the QuickTune script, illustrated with an example of offline GEMM tuning for the Qwen model on an AMD MI308 GPU. Developed by the AMD Quark Team, the QuickTune script delivers significant GEMM performance improvements with minimal time overhead. QuickTune is an advanced tool for hipBLASLt offline GEMM tuning. It allows users to complete offline tuning with one click, instead of using hipblaslt-bench to tune the model manually.

Read more ...


High-Accuracy MXFP4, MXFP6, and Mixed-Precision Models on AMD GPUs

Low-bit quantization has become increasingly important for large language models (LLMs), as model sizes reach hundreds of billions of parameters, where balancing efficiency and accuracy is critical. AMD Quark, the model optimization toolkit from AMD, offers cross-platform optimized models for accurate low-bit model deployment. Building on the concepts we introduced in our previous blog, this blog focuses on MXFP4 and MXFP6 low-precision quantization techniques on large language models and demonstrates how to use Quark to compress LLMs for accurate and efficient deployment on AMD Instinct™ MI355 GPUs.

Read more ...


QuickReduce: Up to 3x Faster All-reduce for vLLM and SGLang

Advancements in large-scale language models (LLMs) have led to significant performance breakthroughs across various domains, especially in natural language processing. LLMs typically consist of billions of parameters, resulting in substantial computational, storage, and deployment challenges. Inter-GPU communication overhead often emerges as a key bottleneck limiting overall system performance. In tensor-parallel setups, every layer requires frequent all-reduce operations—synchronizing large amounts of data across GPUs. This introduces significant latency and strains interconnect bandwidth.

Read more ...