Posts by Daniel Huang

Getting Started with ComfyUI on AMD Radeon™ RX 9000 Series GPUs

ComfyUI has become a widely adopted and versatile node-based interface for Stable Diffusion and other generative AI models, gaining significant traction within the AI content creation community. Unlike traditional web-based interfaces, ComfyUI provides a node-based workflow system that gives users complete control over their image and video generation pipelines. Its modular architecture allows for complex workflows involving multiple models, LoRAs, ControlNets, and custom processing steps.

Read more ...


Quickly Developing Powerful Flash Attention Using TileLang on AMD Instinct MI300X GPU

Against the backdrop of the rapid development of the AMD ROCm™ software ecosystem, the high barrier to operator development has long been a bottleneck. The emergence of TileLang provides developers with an efficient solution. As an emerging AI operator development framework, tilelang encapsulates low-level GPU details with concise syntax, enabling developers to fully tap into the computing potential of AMD GPUs without requiring in-depth knowledge of low-level languages such as HIP. The AMD Instinct™ MI300X GPU, as a flagship GPU for AI workloads, boasts ultra-high bandwidth memory and powerful compute units, but it requires adaptive high-performance operators to unleash its capabilities. In this blog, we will take Flash Attention, a key kernel in both LLM training and inference, as an example to fully demonstrate the development process based on TileLang on the MI300X, highlighting the dual benefits of efficiency and performance that TileLang brings to AMD operator development.

Read more ...


AITER-Enabled MLA Layer Inference on AMD Instinct MI300X GPUs

For developers pushing LLM inference to its limits, efficiency and speed are non-negotiable. DeepSeek-V3’s Multi-head Latent Attention (MLA) layer rethinks traditional attention to cut memory bandwidth pressure while maintaining accuracy. Combined with the matrix absorbed optimization and AMD’s AI Tensor Engine for ROCm (AITER), this can deliver up to 2X faster inference on AMD Instinct™ MI300X GPUs compared to non-AITER runs.

Read more ...