Posts by Dominic Widdows
Benchmarking Reasoning Models: From Tokens to Answers
- 24 July 2025
This blog shows you how to benchmark large language models’ reasoning tasks by distinguishing between mere token generation and genuine problem-solving. You will learn the importance of configuring models like Qwen3 with “thinking mode” enabled, how standard benchmarks can produce misleading results, why reasoning requires more than just generating tokens quickly, and how to build evaluations that reflect the model’s true problem-solving capabilities. Sounds interesting? Let’s dive right in!