AI - Software Tools & Optimizations - Page 2#

Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X
Learn how to optimize DeepSeek-R1 on AMD MI300X with SGLang, AITER kernels and hyperparameter tuning for up to 5× throughput and 60% lower latency over Nvidia H200

AITER: AI Tensor Engine For ROCm
We introduce AMD's AI Tensor Engine for ROCm (AITER), our centralized high performance AI operators repository, designed to significantly accelerate AI workloads on AMD GPUs

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 3
This blog is part 3 of a series aimed at providing a comprehensive, step-by-step guide for deploying and scaling AI inference workloads with Kubernetes and the AMD GPU Operator on the AMD Instinct platform

Optimized ROCm Docker for Distributed AI Training
AMD updated Docker images incorporate torchtune finetuning, FP8 support, single node performance boost, bug fixes & updated benchmarking for stable, efficient distributed training

Measuring Max-Achievable FLOPs – Part 2
AMD measures Max-Achievable FLOPS through controlled benchmarking: real-world data patterns, thermally stable devices, and cold cache testing—revealing how actual performance differs from theoretical peaks.

How to Build a vLLM Container for Inference and Benchmarking
This post, the second in a series, provides a walkthrough for building a vLLM container that can be used for both inference and benchmarking.

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 2
This blog is part 2 of a series aimed at providing a comprehensive, step-by-step guide for deploying and scaling AI inference workloads with Kubernetes and the AMD GPU Operator on the AMD Instinct platform

Understanding Peak, Max-Achievable & Delivered FLOPs, Part 1
Understanding Peak, Max-Achievable & Delivered FLOPs

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 1
This blog is part 1 of a series aimed at providing a comprehensive, step-by-step guide for deploying and scaling AI inference workloads with Kubernetes and the AMD GPU Operator on the AMD Instinct platform

Getting started with AMD ROCm containers: from base images to custom solutions
This post, the second in a series, provides a walkthrough for building a vLLM container that can be used for both inference and benchmarking.

SGLang: Fast Serving Framework for Large Language and Vision-Language Models on AMD Instinct GPUs
Discover SGLang, a fast serving framework designed for large language and vision-language models on AMD GPUs, supporting efficient runtime and a flexible programming interface.

TensorFlow Profiler in practice: Optimizing TensorFlow models on AMD GPUs
TensorFlow Profiler measures resource use and performance of models, helping identify bottlenecks for optimization. This blog demonstrates the use of the TensorFlow Profiler tool on AMD hardware.