AI Blogs#
Agentic Diagnosis for LLM Training at Scale
Explore how AI agents diagnose LLM training incidents — from RCCL hangs to throughput regressions — in one prompt with MaxText-Slurm.
Getting Started with ComfyUI on AMD Radeon™ RX 9000 Series GPUs
Learn how to set up and optimize ComfyUI on AMD Radeon RX 9000 GPUs with ROCm 7.1 — solve common issues and start generating.
HPC Coding Agent - Part 3: MCP Tool for Profiling
Build an AI agent specialized in optimizing HPC workloads by connecting a Cline agent to expert-level AMD profiling tools via a custom MCP server.
Fine-Tuning AI Surrogate Models for Physics Simulations with Walrus on AMD Instinct GPU Accelerators
A showcase of fine-tuning the foundational physics simulation model Walrus on a new physics dataset using AMD Instinct hardware.
Ensemble High-Resolution Weather Forecasting on AMD Instinct GPU Accelerators
A discussion on ensembling in weather forecasting, and a guide on how to run forecasting ensembles on AMD GPUs.
HPC Coding Agent - Part 2: An MCP Tool for Code Optimization with OpenEvolve
Learn how to use OpenEvolve as an MCP tool with an AI agent for agentic code optimization
MaxText-Slurm: Production-Grade LLM Training with Built-In Observability
MaxText-Slurm: A unified launch system for production-grade LLM training with observability on AMD GPU clusters.
Streamlining Recommendation Model Training on AMD Instinct™ GPUs
Explore how the ROCm training docker can be used for recommendation model training on Instinct GPUs, along with a guide on configuring the workload.
Exploring Use Cases for Scalable AI: Implementing Ray with ROCm 7 Support for Efficient ML Workflows
Ray with ROCm helps you scale AI applications for training and inference workloads on AMD GPUs.
Getting Started with AMD Resource Manager: Efficient Sharing of AMD Instinct™ GPUs for R&D Teams and AI Practitioners
Learn how to utilize the AMD Resource Manager by following this step-by-step guide on how to setup projects, share compute resources and monitor resource utilization.
JAX-AITER: Bringing AMD’s Optimized AI Kernels to JAX on ROCm™
Use JAX-AITER to run AMD’s AITER-optimized AI kernels from JAX on AMD ROCm, starting with faster multi-head attention and expanding to more ops.
PyTorch Offline Tuning with TunableOp
Learn how to accelerate PyTorch workloads with TunableOp offline tuning—record, tune separately, and deploy faster inference.