Recent Posts#
Engineering Qwen-VL for Production: Vision Module Architecture and Optimization Practices
Explore how to optimize Qwen-VL for production on AMD Instinct MI308X GPUs with ROCm, from vision module architecture to kernel fusion and deployment.
GROMACS on AMD Instinct GPUs: A Complete Build Guide
Build GROMACS with HIP, UCX, and OpenMPI on AMD MI300X/MI355X — covering bare metal, Apptainer, and Docker deployments.
Accelerating Kimi-K2.5 on AMD Instinct™ MI300X: Optimizing Fused MoE with FlyDSL
Optimize Kimi-K2.5 on AMD MI300X using FlyDSL for fused MoE kernel acceleration. Achieve faster TTFT, TPOT, and throughput with our step-by-step optimization guide.
AMD Device Metrics Exporter v1.4.2: Enhanced Observability, Deeper RAS Insights, and Smarter GPU Telemetry for Modern HPC & AI Clusters
Struggling with GPU bottlenecks? Learn how AMD DME v1.4.2 uncovers power, thermal, and RAS issues with actionable, production-ready telemetry.
Edge-to-Cloud Robotics with AMD ROCm: From Data Collection to Real-Time Inference
This blog demonstrates a comprehensive Edge-to-Cloud robotics AI solution powered by the AMD ecosystem and the Hugging Face LeRobot framework.
Utilizing AMD Instinct GPU Accelerators for Weather and Precipitation Forecasting with NeuralGCM
A showcase of how to run NeuralGCM, a hybrid GCM model, on AMD Instinct hardware, including an introduction, installation, inference, and plotting.
hipBLASLt Online GEMM Tuning
Learn how to improve model performance with hipBLASLt online tuning merged into LLM framework
Multi-Node Distributed Inference for Diffusion Models with xDiT
Follow a tutorial on multi-node video generation with diffusion models, covering scaling considerations and a practical Docker-based example.
GROMACS Performance on AMD Instinct MI355X
Explore GROMACS molecular dynamics performance benchmarks on AMD Instinct MI355X GPUs with HIP acceleration.
FP8 GEMM Optimization on AMD CDNA™4 Architecture
Learn how to build high-performance FP8 GEMM kernels on AMD CDNA™4 GPUs using MFMA, LDS swizzling, and double-buffering.
Agentic Diagnosis for LLM Training at Scale
Explore how AI agents diagnose LLM training incidents — from RCCL hangs to throughput regressions — in one prompt with MaxText-Slurm.
Getting Started with ComfyUI on AMD Radeon™ RX 9000 Series GPUs
Learn how to set up and optimize ComfyUI on AMD Radeon RX 9000 GPUs with ROCm 7.1 — solve common issues and start generating.