Yixing Xu#
Yixing Xu is an algorithm engineer with 10 years experiences. Currently, he focuses on model compression/acceleration techniques.
Posts by Yixing Xu
FLy: A New Paradigm for Speculative Decoding — Accepting Semantically Correct Drafts Beyond Exact Match
This blog explores a new training-free loosely speculative decoding method, that can accept mismatches that are semantically valid and speedup original SPD method.
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
In this blog we will discuss SparK, a training-free, plug-and-play method for KV cache compression in large language models (LLMs).
Týr-the-Pruner: Search-based Global Structural Pruning for LLMs
This blog introduces Týr-the-Pruner, a search-based, end-to-end framework for global structural pruning of large language models (LLMs).
Gumiho: A New Paradigm for Speculative Decoding — Earlier Tokens in a Draft Sequence Matter More
Gumiho boosts LLM inference with early-token accuracy, blending serial + parallel decoding for speed, accuracy, and ROCm-optimized deployment.
Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.1 Submission
In this blog, we will provide step by step instruction on how to reproduce AMD's MLPerf Inference v5.1 Submission
Technical Dive into AMD's MLPerf Inference v5.1 Submission
In this blog, we share the technical details of how we accomplish the results in our MLPerf Inference v5.1 submission.
Slim Down Your Llama: Pruning & Fine-Tuning for Maximum Performance
This blog describes the technical details of how we prune and fine tune the Llama 3.1 405B model in our MLPerf Inference v5.1 submission.