Developers - Software Tools & Optimizations#
Optimizing MI300X Inter-Chiplet Communication via the RCCL Tuner API
Learn how to build a topology-aware RCCL tuner plugin for MI300X CPX/NPS4 mode and validate it with rccl-tests.
Accelerating LLM Inference on AMD GPUs with Low-Latency GEMMs
Learn how FlyDSL low-latency GEMMs speed up LLM decode on AMD GPUs with Split-K, K-slice parallelism, and an LDS-based pipeline.
OpenXLA and JAX - ROCm Support and the State of CI
Learn how OpenXLA and JAX run on AMD ROCm: what landed this year, how every PR is gated on real Instinct hardware, and how to get started.
Building and Deploying Custom hipBLASLt Libraries on AMD Instinct GPUs
Learn how to manage hipBLASLt environments with custom source builds, RPM/DEB packaging, and version switching on AMD Instinct GPUs.
Dropless MoE Training in JAX with Primus-Turbo
Learn how to train dropless MoE in JAX/MaxText with Primus-Turbo's grouped GEMM and DeepEP all-to-all for faster, more memory-efficient training.
Performance Profiling on AMD GPUs - Part 4: Fortran OpenMP Offload Edition
Guides developers through profiling and optimizing Fortran OpenMP GPU offload applications using ROCm tools
From Build to Benchmark: ONNX Model Serving with Triton Inference Server on AMD GPUs
Step-by-step guide to building, deploying, and benchmarking ONNX models with Triton Inference Server and MIGraphX on AMD GPUs
vLLM-ATOM: Unlocking Native AMD Performance in the vLLM Ecosystem
Use ATOM as an out-of-tree vLLM plugin to keep vLLM compatibility while enabling AMD-optimized attention, model execution, and multi-model support including Kimi-K2.5.
Primus Projection: Estimate Memory and Performance Before You Train
Learn how to use the Primus projection tool to estimate memory and performance for large-scale LLM training on AMD Instinct™ accelerator platforms.
Getting Started with FlyDSL Nightly Wheels on ROCm
A practical guide to installing and using FlyDSL nightly wheels on ROCm for fast, Python-native GPU kernel development
Agentic Diagnosis for LLM Training at Scale
Explore how AI agents diagnose LLM training incidents — from RCCL hangs to throughput regressions — in one prompt with MaxText-Slurm.
MaxText-Slurm: Production-Grade LLM Training with Built-In Observability
MaxText-Slurm: A unified launch system for production-grade LLM training with observability on AMD GPU clusters.