Software tools & optimizations - Page 4#
Discover the latest blogs about ROCm software tools, libraries, and performance optimizations to help you get the most out of your AMD hardware.
What's New in the AMD GPU Operator v1.2.0 Release
This blog highlights the new feature enhancements that were released as part of the AMD GPU Operator v1.2.0 release. New features that enhance the use of AMD Instinct GPUs on Kubernetes including Automated Upgrades, Health Checks and Open-sourcing the codebase.
Introducing ROCprofiler SDK - The Latest Toolkit for Performance Profiling
Discover ROCprofiler SDK – ROCm’s next-generation, unified, scalable, and high-performance profiling toolkit for AI and HPC workloads on AMD GPUs.
Introducing ROCprofiler SDK - The Latest Toolkit for Performance Profiling.
Discover ROCprofiler SDK – ROCm’s next-generation, unified, scalable, and high-performance profiling toolkit for AI and HPC workloads on AMD GPUs.
Speculative Decoding - Deep Dive
This blog shows the performance improvement achieved by applying speculative decoding with Llama models on AMD MI300X GPUs, tested across models, input sizes, and datasets.
Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X
Learn how to optimize DeepSeek-R1 on AMD MI300X with SGLang, AITER kernels and hyperparameter tuning for up to 5× throughput and 60% lower latency over Nvidia H200
AITER: AI Tensor Engine For ROCm
We introduce AMD's AI Tensor Engine for ROCm (AITER), our centralized high performance AI operators repository, designed to significantly accelerate AI workloads on AMD GPUs
AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 3
This blog is part 3 of a series aimed at providing a comprehensive, step-by-step guide for deploying and scaling AI inference workloads with Kubernetes and the AMD GPU Operator on the AMD Instinct platform
Optimized ROCm Docker for Distributed AI Training
AMD updated Docker images incorporate torchtune finetuning, FP8 support, single node performance boost, bug fixes & updated benchmarking for stable, efficient distributed training
Understanding RCCL Bandwidth and xGMI Performance on AMD Instinct™ MI300X
The blog explains the reasons behind RCCL bandwidth limitations and xGMI performance constraints, and provides actionable steps to maximize link efficiency on AMD MI300X
Measuring Max-Achievable FLOPs – Part 2
AMD measures Max-Achievable FLOPS through controlled benchmarking: real-world data patterns, thermally stable devices, and cold cache testing—revealing how actual performance differs from theoretical peaks.
How to Build a vLLM Container for Inference and Benchmarking
This post, the second in a series, provides a walkthrough for building a vLLM container that can be used for both inference and benchmarking.
AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 2
This blog is part 2 of a series aimed at providing a comprehensive, step-by-step guide for deploying and scaling AI inference workloads with Kubernetes and the AMD GPU Operator on the AMD Instinct platform