Posts by Phani Vaddadi

DGL in Depth: SE(3)-Transformer on ROCm 7

05 December 2025

In this post, we demonstrate how to run the SE(3)-Transformer efficiently with Deep Graph Library (DGL) on AMD ROCm, enabling high-performance 3D graph learning for complex geometric models. This builds on our previous blog, which highlighted DGL’s versatility across diverse graph neural network (GNN) workloads, validating functionality, compatibility, and usability.

Read more ...

Modernizing Taichi Lang to LLVM 20 for MI325X GPU Acceleration

04 December 2025

Our first Taichi Lang blog intrduced you to Taichi Lang on AMD’s MI210 and MI250X GPUs. This previous version of Taichi was limited by it’s dependence on outdated versions of LLVM. We have modernized Taichi to LLVM 20 to take advantage of the latest advances in LLVM’s code generation capabilities. This modernization also allows us to make Taichi available for execution on newer AMD Instinct GPUs, MI300X and MI325X. As with our previous blog, we provide you with a guide for understanding Taichi, and walk you through installing Taichi, as well as, writing and executing a Taichi program.

Read more ...

Accelerating Vector Search: hipVS and hipRAFT on AMD

13 November 2025

In this blog, you’ll get an introductory look at hipVS, AMD’s GPU-accelerated vector search library, and its relationship to hipRAFT, a foundational library used by hipVS and other ROCmDS projects. Using an interactive Jupyter notebook, you’ll explore four major vector search methods available in hipVS: Brute-Force KNN, IVF-Flat, IVF-PQ, and CAGRA—each illustrating different trade-offs in accuracy, performance, and memory. You’ll see how to build and query vector search indexes using the hipVS API for applications such as semantic search, recommendation systems, and RAG pipelines. Since the API is compatible with NVIDIA’s cuVS, migrating workflows to AMD hardware is seamless and requires minimal changes.

Read more ...

Announcing MONAI 1.0.0 for AMD ROCm: Breakthrough AI Acceleration for Medical Imaging Models on AMD Instinct™ GPUs

07 October 2025

Today, AMD is thrilled to announce MONAI 1.0.0 for AMD ROCm, now available to the community as part of the ROCm-LS Early Access release. In this blog you will learn how to use MONAI to load an analyze medical images, and see results from more advanced case studies.

Read more ...

Elevating 3D Scene Rendering with GSplat

03 October 2025

In this blog we explore how to use GSplat, a GPU-optimized Python library for training and rendering 3DGS models, on AMD devices. This tutorial will guide you through training a model of a scene from a set of captured images, which will then allow you to render novel views of the scene. We use a port of the original GSplat code that has been optimized for AMD GPUs. The examples used throughout this blog were trained and rendered using an AMD MI300X GPU.

Read more ...

From Ingestion to Inference: RAG Pipelines on AMD GPUs

02 October 2025

Retrieval-Augmented Generation (RAG) is a machine learning architecture that enhances Large Language Models (LLMs) by combining generation with information retrieval from external sources. It was introduced to address the limitations of traditional LLMs by allowing them to access and utilize up-to-date information from internal and/or external knowledge bases. When a query is received, RAG first retrieves relevant documents or information from its knowledge bases, then uses this retrieved context alongside the query to generate more accurate and informed responses. This approach helps reduce hallucinations (making up information) common in standard LLMs, while also enabling the model to access current information not present in its original training data. RAG has become particularly valuable in enterprise applications, such as customer support systems, research assistants, and documentation tools, where accuracy and verifiable information are crucial.

Read more ...

Enabling FlashInfer on ROCm for Accelerated LLM Serving

01 October 2025

FlashInfer is an innovative framework designed to accelerate inference of large language models (LLMs). Given the explosive growth and adoption of models like DeepSeek R1, Llama 3, and Qwen 3, efficient inference is critical to meet the demands of real-world deployment. However, challenges such as GPU memory bottlenecks, throughput limitations, and latency remain significant hurdles for deploying these models at scale.

Read more ...

Coding Agents on AMD GPUs: Fast LLM Pipelines for Developers

30 September 2025

The rapid rise of AI-assisted development is transforming how software is built, with coding agents emerging as powerful tools for modern developers. In this blog, we will show you how to deploy coding agents on AMD GPUs using frameworks such as SGLang, vLLM, and llama.cpp, and walk through a practical workflow example: creating a Minesweeper game using Aider.

Read more ...

Exploring Use Cases for Scalable AI: Implementing Ray with ROCm Support for Efficient ML Workflows

10 September 2025

In this blog, you will learn how to use Ray to easily scale your AI applications from your laptop to multiple AMD GPUs.

Read more ...

Llama.cpp Meets Instinct: A New Era of Open-Source AI Acceleration

09 September 2025

Llama.cpp is an open source implementation of a Large Language Model (LLM) inference framework designed to run efficiently on diverse hardware configurations, both locally and in cloud environments. Its plain C/C++ implementation ensures a dependency-free setup, allowing it to seamlessly support various hardware architectures across CPUs and GPUs. The framework offers a range of quantization options, including 1.5-bit to 8-bit integer quantization, to achieve faster inference and reduced memory usage. Llama.cpp is part of an active open-source community within the AI ecosystem, with over 1200 contributors and almost 4000 releases on its official GitHub repository as of early August, 2025. Designed as a CPU-first C++ library, llama.cpp offers simplicity and easy integration with other programming environments - making it widely compatible and rapidly adopted across diverse platforms, especially among consumer devices.

Read more ...

DGL in the Real World: Running GNNs on Real Use Cases

20 August 2025

In our previous blog post, we introduced the Deep Graph Library (DGL) and highlighted how its support on the AMD ROCm platform unlocks scalable, performant graph neural networks (GNNs) on AMD GPUs. That post focused on the why — the growing relevance of graph workloads and what it means to bring that capability to AMD’s accelerated computing ecosystem.

Read more ...

Graph Neural Networks at Scale: DGL with ROCm on AMD Hardware

31 July 2025

03 October 2025

This blog introduces the Deep Graph Library (DGL) and explores its significance on AMD hardware for enabling scalable, performant graph neural networks.

Read more ...

Accelerating Parallel Programming in Python with Taichi Lang on AMD GPUs

31 July 2025

Taichi Lang is an open-source, imperative, parallel programming language for high-performance numerical computation. It is embedded in Python and uses just-in-time (JIT) compiler frameworks (e.g. LLVM) to offload the compute-intensive Python code to the native GPU or CPU instructions. The language has broad applications spanning real-time physical simulation, numerical computation, augmented reality, artificial intelligence, vision and robotics, visual effects in films and games, general-purpose computing, and much more [1].

Read more ...

Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration

24 April 2025

In this blog post, we provide an overview of Volcano Engine Reinforcement Learning for LLMs (verl) and discuss its benefits in large-scale reinforcement learning from human feedback (RLHF). We also detail the modifications made to the codebase to optimize verl’s performance on AMD Instinct GPUs. Next, we walk through the process of building the Docker image using a Dockerfile on the user side, along with training scripts tailored for both single-node and multi-node setups. Lastly, we present verl’s performance results, focusing on throughput and convergence accuracy achieved on AMD Instinct™ MI300X GPUs. Follow this guide to get started with verl on AMD Instinct GPUs and accelerate your RLHF training with ROCm-optimized performance.

Read more ...

Efficient MoE training on AMD ROCm: How-to use MegaBlocks on AMD GPUs

23 March 2025

Training massive deep-learning models requires a balance of efficiency and scalability. In the context of the Transformers architecture, Mixture of Experts (MoE) models are massive machine learning architectures characterized for dividing tasks among multiple specialized sub-networks or “experts”. A gating network determines the expert to which a given input should be routed, enabling the model to handle complex tasks more efficiently by using the specialized capabilities of each expert. This dynamic routing mechanism allows MoE models to scale efficiently, activating only a subset of the network for each input, therefore reducing computational load while maintaining high model capacity.

Read more ...