Posts by Ye Hur Cheong
Accelerating Multimodal Inference in vLLM: The One-Line Optimization for Large Multimodal Models
- 02 January 2026
Deploying multimodal models like Qwen3-VL or InternVL at scale reveals a hidden bottleneck. While Tensor Parallelism (TP) is essential for massive language decoders, it is often overkill for vision encoders. These encoders are typically small, often just 1-5% of total model size, so there is limited compute benefit from sharding them. However, they still incur expensive all-reduce communication costs after every single layer.
The vLLM MoE Playbook: A Practical Guide to TP, DP, PP and Expert Parallelism
- 24 November 2025
Deploying large Mixture-of-Experts (MoE) models like DeepSeek-R1 efficiently isn’t just about having enough GPUs—it’s about choosing the right parallelism strategy. The wrong choice can lead to duplicated KV caches consuming 8× your memory, or communication overhead that cuts throughput in half. The right choice unlocks significantly better performance for your specific workload.