AI Blogs - Page 5#
Agentic Diagnosis for LLM Training at Scale
Explore how AI agents diagnose LLM training incidents — from RCCL hangs to throughput regressions — in one prompt with MaxText-Slurm.
Getting Started with ComfyUI on AMD Radeon™ RX 9000 Series GPUs
Learn how to set up and optimize ComfyUI on AMD Radeon RX 9000 GPUs with ROCm 7.1 — solve common issues and start generating.
HPC Coding Agent - Part 3: MCP Tool for Profiling
Build an AI agent specialized in optimizing HPC workloads by connecting a Cline agent to expert-level AMD profiling tools via a custom MCP server.
Fine-Tuning AI Surrogate Models for Physics Simulations with Walrus on AMD Instinct GPU Accelerators
A showcase of fine-tuning the foundational physics simulation model Walrus on a new physics dataset using AMD Instinct hardware.
Ensemble High-Resolution Weather Forecasting on AMD Instinct GPU Accelerators
A discussion on ensembling in weather forecasting, and a guide on how to run forecasting ensembles on AMD GPUs.
HPC Coding Agent - Part 2: An MCP Tool for Code Optimization with OpenEvolve
Learn how to use OpenEvolve as an MCP tool with an AI agent for agentic code optimization
MaxText-Slurm: Production-Grade LLM Training with Built-In Observability
MaxText-Slurm: A unified launch system for production-grade LLM training with observability on AMD GPU clusters.
Streamlining Recommendation Model Training on AMD Instinct™ GPUs
Explore how the ROCm training docker can be used for recommendation model training on Instinct GPUs, along with a guide on configuring the workload.
Exploring Use Cases for Scalable AI: Implementing Ray with ROCm 7 Support for Efficient ML Workflows
Ray with ROCm helps you scale AI applications for training and inference workloads on AMD GPUs.
JAX-AITER: Bringing AMD’s Optimized AI Kernels to JAX on ROCm™
Use JAX-AITER to run AMD’s AITER-optimized AI kernels from JAX on AMD ROCm, starting with faster multi-head attention and expanding to more ops.
Getting Started with AMD Resource Manager: Efficient Sharing of AMD Instinct™ GPUs for R&D Teams and AI Practitioners
Learn how to utilize the AMD Resource Manager by following this step-by-step guide on how to setup projects, share compute resources and monitor resource utilization.
PyTorch Offline Tuning with TunableOp
Learn how to accelerate PyTorch workloads with TunableOp offline tuning—record, tune separately, and deploy faster inference.