Featured Posts

Introducing Instella-Long: A Fully Open Language Model with Long-Context Capability
Learn about Instella-Long: AMD’s open 3B language model supporting 128K context, trained on MI300X GPUs, outperforming peers on long-context benchmarks.

The ROCm Revisited Series
We present our ROCm Revisited Series. Discover ROCm's role in leading edge supercomputing, its growing ecosystem-from HIP, to developer tools-powering AI, HPC, and data science across multi-GPU and cluster systems

AMD’s MLPerf Training Debut: Optimizing LLM Fine-Tuning with Instinct™ GPUs
Explore the techniques we used to improve the training performance on MI300X and MI325X in our MLPerf Training 5.0 submission.

HIP 7.0 Is Coming: What You Need to Know to Stay Ahead
Get ready for HIP 7.0—explore key API changes that boost CUDA compatibility and streamline portable GPU development, start preparing your code today.

Enabling Real-Time Context for LLMs: Model Context Protocol (MCP) on AMD GPUs
Learn how to leverage Model Context Protocol (MCP) servers to provide real time context information to LLMs through a chatbot example on AMD GPUs

Continued Pretraining: A Practical Playbook for Language-Specific LLM Adaptation
A step by step guide to adapting LLMs to new languages via continued pretraining, with Poro 2 boosting Finnish performance using Llama 3.1 and AMD GPUs

Fine-Tuning LLMs with GRPO on AMD MI300X: Scalable RLHF with Hugging Face TRL and ROCm
Fine-tune LLMs with GRPO on AMD MI300X—leverage ROCm, Hugging Face TRL, and vLLM for efficient reasoning and scalable RLHF

Aligning Mixtral 8x7B with TRL on AMD GPUs
This blog demonstrates how to fine-tune and align Mixtral 8x7B with TRL using DPO and evaluate it on AMD GPUs.

AMD ROCm: Powering the World's Fastest Supercomputers
Discover how ROCm drives the world’s top supercomputers, from El Capitan to Frontier, and why its shaping the future of scalable, open and sustainable HPC

ROCm Revisited: Getting Started with HIP
New to HIP? This blog will introduce you to the HIP runtime API, its key concepts and installation and practical code examples to showcase its functionality.

ROCm Revisited: Evolution of the High-Performance GPU Computing Ecosystem
Learn how ROCm evolved to support HPC, AI, and containerized workloads with modern tools, libraries, and deployment options.

A Step-by-Step Guide On How To Deploy Llama Stack on AMD Instinct™ GPU
Learn how to use Meta’s Llama Stack with AMD ROCm and vLLM to scale inference, integrate APIs, and streamline production-ready AI workflows on AMD Instinct™ GPU

LLM Quantization with Quark on AMD GPUs: Accuracy and Performance Evaluation
Learn how to use Quark to apply FP8 quantization to LLMs on AMD GPUs, and evaluate accuracy and performance using vLLM and SGLang on AMD MI300X GPUs.

Reproduce AMD's MLPerf Training v5.0 Submission Result with Instinct™ GPUs
Follow this step-by-step guide to reproduce AMDs MLPerf 5.0 Training Submission with Instinct GPUs using ROCm

High-Throughput BERT-L Pre-Training on AMD Instinct™ GPUs: A Practical Guide
Learn how to optimize BERT-L training with mixed precision and Flash Attention v2 on AMD Instinct GPUs — follow our tested MLPerf-compliant step-by-step guide.

Scale LLM Inference with Multi-Node Infrastructure
Learn how to horizontally scale LLM inference using open-source tools on MI300X, with vLLM, nginx, Prometheus, and Grafana.

ROCm Runfile Installer Is Here!
Overview of ROCm Runfile Installer introduced in ROCm 6.4, allowing a complete single package for driver and ROCm installation without internet connectivity

From Theory to Kernel: Implement FlashAttention-v2 with CK-Tile
Learn how to implement FlashAttention-v2 with CK-Tile: minimize memory overhead, maximize compute efficiency, and scale on AMD GPUs

Introducing ROCm-DS: GPU-Accelerated Data Science for AMD Instinct™ GPUs
Accelerate data science with ROCm-DS: AMD’s GPU-optimized toolkit for faster data frames and graph analytics using hipDF and hipGRAPH

Unleash Full GPU Potential: Overlap Communication and Computation with Triton-Distributed
Unlock the full power of AMD GPUs—write portable, efficient kernels with Triton-Distributed, overlapping computation and communication with ease and flexibility
Stay informed
- Subscribe to our RSS feed (Requires an RSS reader available as browser plugins.)
- Signup for the ROCm newsletter
- View our blog statistics
- View the ROCm Developer Hub
- Report an issue or request a feature
- We are eager to learn from our community! If you would like to contribute to the ROCm Blogs, please submit your technical blog for review at our GitHub. Blog creation can be started through our GitHub issues form.