Posts by Jiangyong Ren

Advanced MXFP4 Quantization: Combining Fine-Tuned Rotations with SmoothQuant for Near-Lossless Compression

As language models continue to grow in popularity, reducing the cost of inference and accelerating model serving have become key challenges. Quantization offers a powerful solution by reducing the model size and leveraging inexpensive math operations, for example, using low-bitwidth formats like OCP MXFP4 (4.25 bits) available in AMD Instinct MI350X and MI355X accelerators.

Read more ...


QuickReduce: Up to 3x Faster All-reduce for vLLM and SGLang

Advancements in large-scale language models (LLMs) have led to significant performance breakthroughs across various domains, especially in natural language processing. LLMs typically consist of billions of parameters, resulting in substantial computational, storage, and deployment challenges. Inter-GPU communication overhead often emerges as a key bottleneck limiting overall system performance. In tensor-parallel setups, every layer requires frequent all-reduce operations—synchronizing large amounts of data across GPUs. This introduces significant latency and strains interconnect bandwidth.

Read more ...