Posts tagged Systems

QuickReduce INT3 Quantization and Benchmarking on MI355

13 July 2026

Large Language Models (LLMs) typically contain billions — or even tens of billions — of parameters. During inference, tensor parallelism (TP) is a widely used technique that distributes the compute across multiple GPUs. This approach, however, requires frequent, large-scale data synchronization between layers, introducing significant communication latency and placing enormous pressure on interconnect bandwidth.

Read more ...

QuickReduce FP4 Quantization and Benchmarking on MI355

20 May 2026

Large Language Models (LLMs) typically contain billions — or even tens of billions — of parameters. During inference, tensor parallelism is commonly employed to distribute the workload across multiple GPUs. This approach demands frequent, large-scale data synchronization between layers, introducing significant communication latency and placing enormous pressure on interconnect bandwidth.

Read more ...

ROCm Revisited: Evolution of the High-Performance GPU Computing Ecosystem

06 June 2025

09 June 2025

This blog is part of our ROCm Revisited series [1]. The purpose of this series is to share the story of ROCm and our journey through the changes and successes we’ve achieved over the past few years. We’ll explore the key milestones in our development, the innovative technologies that have propelled us forward, and the challenges we’ve overcome to establish our leadership in the world of GPU computing.

Read more ...

ROCm Runfile Installer Is Here!

22 May 2025

From ROCm 6.4, and after much user demand, we are introducing the ROCm Runfile Installer method primarily for network secured environments, or those who wish to bypass a native Linux package management system, or those that just want to download and run a single file to install ROCm.

Read more ...