Posts by Debasis Mandal
Accelerating ComfyUI Workflows on AMD Instinct™ MI355X GPUs with ROCm
- 11 May 2026
ComfyUI is an open-source, graphical node-based interface for building generative AI workflows using diffusion models. With over 100,000 stars on GitHub, it has become one of the most widely adopted tools for text-to-image, text-to-video, and image-to-3D generation. You can build workflows by connecting nodes in a drag-and-drop visual interface (no coding required). A large community contributes custom nodes and workflow templates, making ComfyUI a versatile front-end for models ranging from 12B to 27B parameters. For background and setup across AMD platforms, see the earlier ROCm blogs Running ComfyUI on AMD Instinct, Getting Started with ComfyUI on AMD Radeon™ RX 9000 Series GPUs, and Running ComfyUI in Windows with ROCm on WSL.
Serving CTR Recommendation Models with Triton Inference Server using the ONNX Runtime Backend
- 07 April 2026
In a previous ROCm blog post, “Triton Inference Server with vLLM on AMD GPUs”, deploying large language models using Triton Inference Server with the vLLM backend on ROCm-enabled AMD GPUs was introduced. In this blog, you will explore the ONNX Runtime and Python backends in the ROCm build of Triton Inference Server, along with an upgrade that aligns the build with the latest upstream Triton Inference Server release. You will also see how these enhancements expand AI model deployment capabilities and highlight the performance advantages of AMD Instinct GPUs using a representative recommendation model.
FlashInfer on ROCm: High‑Throughput Prefill Attention via AITER
- 06 April 2026
The explosive growth of large language models (LLMs) like DeepSeek-R1, Llama 3, and Qwen 3 has created an urgent need for efficient inference solutions. As these models scale to billions of parameters and context lengths extend to hundreds of thousands of tokens, the attention mechanism becomes a critical bottleneck, consuming substantial memory for key-value (KV) caches and requiring significant compute for each token generated.
Enabling FlashInfer on ROCm for Accelerated LLM Serving
- 01 October 2025
FlashInfer is an innovative framework designed to accelerate inference of large language models (LLMs). Given the explosive growth and adoption of models like DeepSeek R1, Llama 3, and Qwen 3, efficient inference is critical to meet the demands of real-world deployment. However, challenges such as GPU memory bottlenecks, throughput limitations, and latency remain significant hurdles for deploying these models at scale.