Yixing Xu

Yixing Xu#

Yixing Xu is an algorithm engineer with 10 years experiences. Currently, he focuses on model compression/acceleration techniques.

Posts by Yixing Xu

April 20, 2026

FLy: A New Paradigm for Speculative Decoding — Accepting Semantically Correct Drafts Beyond Exact Match

This blog explores a new training-free loosely speculative decoding method, that can accept mismatches that are semantically valid and speedup original SPD method.

https://rocm.blogs.amd.com/artificial-intelligence/fly/README.html

January 02, 2026

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

In this blog we will discuss SparK, a training-free, plug-and-play method for KV cache compression in large language models (LLMs).

https://rocm.blogs.amd.com/artificial-intelligence/spark-blog/README.html

December 03, 2025

Týr-the-Pruner: Search-based Global Structural Pruning for LLMs

This blog introduces Týr-the-Pruner, a search-based, end-to-end framework for global structural pruning of large language models (LLMs).

https://rocm.blogs.amd.com/artificial-intelligence/tyr-the-pruner/README.html

October 14, 2025

Gumiho: A New Paradigm for Speculative Decoding — Earlier Tokens in a Draft Sequence Matter More

Gumiho boosts LLM inference with early-token accuracy, blending serial + parallel decoding for speed, accuracy, and ROCm-optimized deployment.

https://rocm.blogs.amd.com/software-tools-optimization/gumiho/README.html

September 09, 2025

Technical Dive into AMD's MLPerf Inference v5.1 Submission

In this blog, we share the technical details of how we accomplish the results in our MLPerf Inference v5.1 submission.

https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference-v5.1/README.html

September 09, 2025

Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.1 Submission

In this blog, we will provide step by step instruction on how to reproduce AMD's MLPerf Inference v5.1 Submission

https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference5.1-repro/README.html

September 09, 2025

Slim Down Your Llama: Pruning & Fine-Tuning for Maximum Performance

This blog describes the technical details of how we prune and fine tune the Llama 3.1 405B model in our MLPerf Inference v5.1 submission.

https://rocm.blogs.amd.com/artificial-intelligence/mlperf-llama-pruning/README.html