Xuanwu Yin

Xuanwu Yin#

Xuanwu Yin leads the model optimization team, driving work on model quantization, sparsity, speculative decoding, and efficient training/inference across multiple platforms. His team delivers high-performance, production-ready solutions for large language models, vision-language models, and image/video-generation pipelines, while providing direct support to customers.

Posts by Xuanwu Yin

January 12, 2026

Athena-PRM: Enhancing Multimodal Reasoning with Data-Efficient Process Reward Models

Learn how to utilize a data-efficient Process Reward Model to enhance the reasoning ability of the Large Language/Multimodal Models.

https://rocm.blogs.amd.com/artificial-intelligence/amd-elvm/README.html

January 07, 2026

Breaking the Accuracy-Speed Barrier: How MXFP4/6 Quantization Revolutionizes Image and Video Generation

Explore how MXFP4/6, supported by AMD Instinct™ MI350 series GPUs, achieves BF16-comparable image and video generation quality.

https://rocm.blogs.amd.com/artificial-intelligence/mxfp-t2i-t2v/README.html

January 02, 2026

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

In this blog we will discuss SparK, a training-free, plug-and-play method for KV cache compression in large language models (LLMs).

https://rocm.blogs.amd.com/artificial-intelligence/spark-blog/README.html

December 03, 2025

Týr-the-Pruner: Search-based Global Structural Pruning for LLMs

This blog introduces Týr-the-Pruner, a search-based, end-to-end framework for global structural pruning of large language models (LLMs).

https://rocm.blogs.amd.com/artificial-intelligence/tyr-the-pruner/README.html

October 14, 2025

Gumiho: A New Paradigm for Speculative Decoding — Earlier Tokens in a Draft Sequence Matter More

Gumiho boosts LLM inference with early-token accuracy, blending serial + parallel decoding for speed, accuracy, and ROCm-optimized deployment.

https://rocm.blogs.amd.com/software-tools-optimization/gumiho/README.html

September 09, 2025

Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.1 Submission

In this blog, we will provide step by step instruction on how to reproduce AMD's MLPerf Inference v5.1 Submission

https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference5.1-repro/README.html

September 09, 2025

Technical Dive into AMD's MLPerf Inference v5.1 Submission

In this blog, we share the technical details of how we accomplish the results in our MLPerf Inference v5.1 submission.

https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference-v5.1/README.html

September 09, 2025

Slim Down Your Llama: Pruning & Fine-Tuning for Maximum Performance

This blog describes the technical details of how we prune and fine tune the Llama 3.1 405B model in our MLPerf Inference v5.1 submission.

https://rocm.blogs.amd.com/artificial-intelligence/mlperf-llama-pruning/README.html

August 22, 2025

Introducing AMD EVLM: Efficient Vision-Language Models with Parameter-Space Visual Conditioning

A novel approach that replaces visual tokens with perception-conditioned weights, reducing compute while maintaining strong vision-language performance.

https://rocm.blogs.amd.com/artificial-intelligence/elvm,-vlms,-llm,/README.html