Posts by Seungrok Jung
Optimizing DeepseekV3 Inference on SGLang Using ROCm Profiling Tools
- 01 May 2025
As LLMs are growing in size and complexity, ensuring proper utilization of compute resources becomes of prime importance. Performance profiling and kernel-level analysis are essential techniques for diagnosing runtime bottlenecks, such as GPU time, memory-bound operations, and inefficient device-host memory transfers etc. By using profiling tools like RocmProfileData (RPD) and TorchProfiler (PyTorch Profiler) developers have access to granular level insight into kernel execution timelines, data movement patterns, and computational hotspots. In this blog, we delve into how profiling and kernel diagnostics can expose inefficiencies in components like attention mechanisms and Mixture-of-Experts (MoE) layers — and guide targeted optimizations at the kernel level.
Power Up Qwen 3 with AMD Instinct: A Developer’s Day 0 Quickstart
- 28 April 2025
AMD is excited to announce Day 0 support for Alibaba’s latest Large Language Models Qwen3-235B Qwen3-32B Qwen3-30B on AMD Instinct™ MI300X GPU accelerators using vLLM and SGLang. In this blog we show you how to accelerate Alibaba’s cutting-edge Qwen 3 language models, featuring advanced reasoning, multilingual capabilities, and agent functionality, using AMD Instinct™ MI300X GPUs. You will learn to deploy dense and Mixture-of-Experts models with full support for vLLM and SGLang, leveraging AMD’s advanced GPU architecture for high-throughput, low-latency inference.
Boosting Llama 4 Inference Performance with AMD Instinct MI300X GPUs
- 28 April 2025
In our previous blog post, we explored how to deploy Llama 4 using AMD Instinct™ MI300X GPUs with vLLM. We also highlighted that MI300X and MI325X GPUs are capable of running the full 400B-parameter Llama 4 Maverick model in BF16 precision on a single node, significantly reducing infrastructure complexity. Their substantial HBM memory capacity further supports extended context lengths, enabling high throughput and efficient model execution.
Power Up Llama 4 with AMD Instinct: A Developer’s Day 0 Quickstart
- 06 April 2025
AMD is excited to announce Day 0 support for Meta’s latest leading multimodal intelligence Models — the Llama 4 Maverick and Scout models — on our AMD Instinct™ MI300X and MI325X GPU accelerators using vLLM. In this blog we will walk you through a step-by-step guide on deploying Meta’s Llama4 model using vLLM, docker setup, dependencies, and inference testing.
Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X
- 21 March 2025
Our previous blog post on this topic discussed how DeepSeek-R1 achieves competitive performance on AMD Instinct™ MI300X GPUs. We also included performance comparisons against Nvidia H200 GPUs and a short demo application illustrating real-world usage. In this blog we will delve into how using the SGLang framework, critical kernel optimizations like AI Tensor Engine for ROCm™, and hyperparameter tuning helps to achieve performance boosts.