Recent Posts#
Streamlining Recommendation Model Training on AMD Instinct™ GPUs
Explore how the ROCm training docker can be used for recommendation model training on Instinct GPUs, along with a guide on configuring the workload.
MaxText-Slurm: Production-Grade LLM Training with Built-In Observability
MaxText-Slurm: A unified launch system for production-grade LLM training with observability on AMD GPU clusters.
Exploring Use Cases for Scalable AI: Implementing Ray with ROCm 7 Support for Efficient ML Workflows
Ray with ROCm helps you scale AI applications for training and inference workloads on AMD GPUs.
PyTorch Offline Tuning with TunableOp
Learn how to accelerate PyTorch workloads with TunableOp offline tuning—record, tune separately, and deploy faster inference.
Getting Started with AMD Resource Manager: Efficient Sharing of AMD Instinct™ GPUs for R&D Teams and AI Practitioners
Learn how to utilize the AMD Resource Manager by following this step-by-step guide on how to setup projects, share compute resources and monitor resource utilization.
JAX-AITER: Bringing AMD’s Optimized AI Kernels to JAX on ROCm™
Use JAX-AITER to run AMD’s AITER-optimized AI kernels from JAX on AMD ROCm, starting with faster multi-head attention and expanding to more ops.
Primus-Pipeline: A More Flexible and Scalable Pipeline Parallelism Implementation
Learn how to use our flexible and scalable pipeline parallelism framework with Primus backend and AMD hardware.
FlyDSL: Expert GPU Kernel Development with the Ease of MLIR Python Native DSL on AMD GPUs
FlyDSL is a Python-first, MLIR-native DSL for expert GPU kernel development and tuning on AMD GPUs.
Introducing hipThreads: A C++ - Style Concurrency Library for AMD GPUs
Discover how hipThreads lets you write hip::thread just like std::thread and unlock GPU acceleration with minimal code changes.
Unlocking Sparse Acceleration on AMD GPUs with hipSPARSELt
This blog post introduces semi-structured sparsity technology supported on AMD systems and explains how to use the corresponding library to leverage its benefit.
Advanced MXFP4 Quantization: Combining Fine-Tuned Rotations with SmoothQuant for Near-Lossless Compression
Showcase advanced algorithms available in AMD Quark for efficient MXFP4 quantization on AMD Instinct accelerators with high accuracy retention.
Adaptive Top-K Selection: Eliminating Performance Cliffs Across All K Values on AMD GPUs
Explore adaptive Top-K on MI300X! See how auto-selection and hardware optimizations like DPP and double buffering drive peak efficiency.