Posts by Shenrun Zhang
Accelerating LLM Inference: Up to 3x Speedup on MI300X with Speculative Decoding
- 27 March 2025
In this blog you will learn how speculative decoding boosts LLM inference, providing out-of-the-box speedups in LLM token generation on the AMD Instinct™ MI300X GPU. We start the blog by providing you with a brief overview of Speculative Decoding. We then demonstrate, through extensive benchmarking on a number of LLMs and datasets, as well as on different frameworks viz. vLLM and native PyTorch (gpt-fast), speedups in the range of 1.3x - 3x in the LLM generation throughput (tokens/second) through speculative decoding as compared to running a vanilla LLM for batch size 1. We show you how these speedups vary for batch sizes greater than 1 in vLLM. Finally, we will share a detailed profiling-based case study to identify some high-level differences between these two frameworks, i.e. the type of kernels that are launched and their overall latencies, which are critical differentiators between the performance of these frameworks. Let’s get started!