Software tools & optimizations - Page 2#
Discover the latest blogs about ROCm software tools, libraries, and performance optimizations to help you get the most out of your AMD hardware.
Reimagining GPU Allocation in Kubernetes: Introducing the AMD GPU DRA Driver
Explore how the AMD GPU DRA Driver brings declarative, attribute-aware GPU scheduling to Kubernetes — learn how to request and manage GPUs natively
Introducing the AMD Network Operator v1.0.0: Simplifying High-Performance Networking for AMD Platforms
Introducing the AMD Network Operator for automating high-performance AI NIC networking in Kubernetes for AI and HPC workloads
Accelerating Multimodal Inference in vLLM: The One-Line Optimization for Large Multimodal Models
Learn how to optimize multimodal model inference with batch-level data parallelism for vision encoders in vLLM, achieving up to 45% throughput gains on AMD MI300X.
GEAK HIP: Expanding GEAK for HIP Code Optimization
Explore the GEAK frameworks AI-driven HIP code optimization for improved performance on AMD GPUs, including speedup examples and benefits for AI workloads.
Getting Started with AMD AI Workbench: Deploying and Managing AI Workloads
Learn how to deploy and manage AI workloads with AMD AI Workbench, a low-code interface for developers to manage AI inference deployments
MoE Training Best Practices on AMD GPUs
Learn how to optimize Mixture-of-Experts (MoE) model training on AMD Instinct GPUs with ROCm. Maximize your AI training performance now!
3D Scene Reconstruction from the Inside: Explore the Mathematics Behind gsplat
3D Scene Reconstruction from the Inside: Explore the Mathematics Behind gsplat
The vLLM MoE Playbook: A Practical Guide to TP, DP, PP and Expert Parallelism
Learn how to combine TP, DP, PP, and EP for MoE models. Discover proven strategies to maximize performance on your vLLM deployments.
Accelerating Vector Search: hipVS and hipRAFT on AMD
Learn how hipVS accelerates vector search on AMD Instinct GPUs, with notebook demos for semantic search, RAG, and recommendation systems.
Practical, Fault‑Robust Distributed Inference for DeepSeek on AMD MI300X
Learn how a small-radius expert parallel design with prefill–decode disaggregation enables scalable, fault-isolated LLM inference on AMD Instinct™ MI300X clusters.
Stability at Scale: AMD’s Full‑Stack Platform for Large‑Model Training
Primus streamlines LLM training on AMD GPUs with unified configs, multi-backend support, preflight validation, and structured logging.
High-Accuracy MXFP4, MXFP6, and Mixed-Precision Models on AMD GPUs
Learn to leverage AMD Quark for efficient MXFP4/MXFP6 quantization on AMD Instinct accelerators with high accuracy retention.