Recent Posts - Page 11#
Optimizing FP4 Mixed-Precision Inference with Petit on AMD Instinct MI250 and MI300 GPUs: A Developer’s Perspective
Learn how FP4 mixed-precision on AMD GPUs boosts inference speed and integrates seamlessly with SGLang.
Elevating 3D Scene Rendering with GSplat
ROCm Port of GSplat - GPU accelerated rasterization of Gaussian splatting
Optimizing Drug Discovery Tools on AMD MI300X Part 2: 3D Molecular Generation with SemlaFlow
Learn how to set up, run, and optimize SemlaFlow, a molecular generation tool, on AMD MI300X GPUs for faster drug discovery workflows
From Ingestion to Inference: RAG Pipelines on AMD GPUs
Build a RAG enhanced GenAI application that improves the quality of model responses by incorporating data that is missing in the model training data.
GPU Partitioning Made Easy: Pack More AI Workloads Using AMD GPU Operator
What’s New in AMD GPU Operator: Learn About GPU Partitioning and New Kubernetes Features
Enabling FlashInfer on ROCm for Accelerated LLM Serving
FlashInfer is an open-source library for accelerating LLM serving that is now supported by ROCm.
Matrix Core Programming on AMD CDNA™3 and CDNA™4 architecture
This blog post explains how to use Matrix Cores on CDNA3 and CDNA4 architecture, with a focus on low-precision data types such as FP16, FP8, and FP4
Coding Agents on AMD GPUs: Fast LLM Pipelines for Developers
Accelerate AI-assisted coding with agentic workflows on AMD GPUs. Deploy DeepSeek-V3.1 via SGLang, vLLM, or llama.cpp to power fast, scalable coding agents
Day-0 Support for the SGLang-Native RL Framework - slime on AMD Instinct™ GPUs
Learn how to deploy slime on AMD GPUs for high-performance RL training with ROCm optimization
Accelerating Audio-Driven Video Generation: WAN2.2-S2V on AMD ROCm
This blog will highlight AMD ROCm’s ability to power next-generation audio-to-video models with simple, reproducible workflows.
A Simple Design for Serving Video Generation Models with Distributed Inference
Minimalist FastAPI + Redis + Torchrun design for serving video generation models with distributed inference.
An Introduction to Primus-Turbo: A Library for Accelerating Transformer Models on AMD GPUs
Primus streamlines training on AMD ROCm, from fine-tuning to massive pretraining on MI300X GPUs—faster, safer, and easier to debug