Posts by Xuanwu Yin

Low Kruskal-Rank Adaptation

11 June 2026

In this blog, you will explore how to enhance Low-Rank Adaptation (LoRA) which uses matrix rank, and replace it with Kruskal rank for efficient training. LoRA is one of the most widely used parameter-efficient fine-tuning (PEFT) methods for adapting pre-trained large language models (LLMs) to downstream tasks. Although LoRA significantly reduces the number of trainable parameters and lowers fine-tuning costs, its performance is often limited by the inherent low-rank assumption. We revisit the notion of rank for LoRA update matrices and show that the standard matrix rank fails to capture duplicated directions and redundancy in the update subspace. Motivated by this analysis, we argue that the Kruskal rank offers a more informative criterion for characterizing update diversity. We therefore propose Low Kruskal Rank Adaptation (LoKRA), a new PEFT algorithm with provable theoretical guarantees that mitigates the limitations of LoRA. We further introduce LoKRA+, an enhanced variant that provides a tighter theoretical lower bound on the Kruskal rank and yields stronger empirical performance. Experiments on multiple LLMs show that our approach consistently outperforms LoRA and other baselines, establishing state-of-the-art performance across a range of benchmarks. The paper is accepted by ICML 2026 (paper link), and the code is publicly available on GitHub.

Read more ...

Enabling Speculative Speculative Decoding on MI300X

29 May 2026

Speculative speculative decoding (SSD) [1] is a recently proposed speculative decoding (SD) algorithm that further accelerates large language model (LLM) inference beyond conventional SD. In standard SD, a small draft model proposes several future tokens, and a large target model verifies them in parallel. SD already reduces the cost of purely autoregressive decoding, but it still contains a sequential dependency: the next draft step cannot start until the current verification step finishes.

Read more ...

FLy: A New Paradigm for Speculative Decoding — Accepting Semantically Correct Drafts Beyond Exact Match

20 April 2026

Speculative decoding has emerged as a highly effective approach to accelerate large language model (LLM) inference, yet existing methods are severely bottlenecked by a rigid exact-match verification rule that discards many semantically valid continuations. Furthermore, existing training-based loose decoding methods often suffer from significant performance degradation on out-of-distribution (OOD) tasks.

Read more ...

Athena-PRM: Enhancing Multimodal Reasoning with Data-Efficient Process Reward Models

12 January 2026

This blog introduces Athena-PRM, a multimodal Process Reward Model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. To efficiently generate high-quality process-labeled data, we leverage prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. We also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data.

Read more ...

Breaking the Accuracy-Speed Barrier: How MXFP4/6 Quantization Revolutionizes Image and Video Generation

07 January 2026

This blog introduces MXFP4 and MXFP6, the newly supported data types on AMD Instinct™ MI350 Series GPUs, and demonstrates their remarkable quality in image and video generation tasks. By reading this blog, you will discover how these low-bit formats can break the accuracy-speed tradeoff, boosting both efficiency and performance in generative AI workflows.

Read more ...

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

02 January 2026

In this blog we will discuss SparK, a training-free, plug-and-play method for KV cache compression in large language models (LLMs). By addressing the overlooked redundancy in feature channels and employing a “prune-and-recover” strategy, SparK reduces KV cache storage by over 30% compared to traditional methods while maintaining model accuracy. It offers a robust solution for long-context inference, establishing a new perspective on unstructured sparsity.

Read more ...

Týr-the-Pruner: Search-based Global Structural Pruning for LLMs

03 December 2025

This blog introduces Týr-the-Pruner, a search-based, end-to-end framework for global structural pruning of large language models (LLMs). By constructing a supernet of layer-wise pruned candidates with different sparsity levels and searching for the optimal sparsity distribution under a target overall sparsity, Týr-the-Pruner removes up to 50% of parameters while retaining ~97% of dense accuracy on Llama-3.1-70B—establishing a new state of the art among structured pruning methods. Experiments also show tangible inference speedups on AMD Instinct™ GPUs. Read the full paper and try the implementation. This work has been accepted to NeurIPS 2025.

Read more ...

Gumiho: A New Paradigm for Speculative Decoding — Earlier Tokens in a Draft Sequence Matter More

14 October 2025

Speculative decoding has emerged as a promising approach to accelerate large language model (LLM) inference, yet existing methods face a tradeoff: parallel designs achieve higher speed but lose accuracy, while serial designs gain accuracy at the cost of efficiency. In our recent paper Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding, we introduce a new paradigm that addresses this bottleneck by prioritizing accuracy on the earliest draft tokens, which matters most for downstream acceptance. In this blog, we will discuss the motivation behind Gumiho, the theoretical foundation showing why early-token accuracy dominates, and the novel hybrid architecture that combines serial and parallel decoding to realize these insights. Our goal is to demonstrate both the scientific contributions and practical benefits of Gumiho, showing how it delivers state-of-the-art performance on AMD GPUs using the ROCm software stack, ensuring that the method is widely accessible and optimized for real-world deployment.

Read more ...

Technical Dive into AMD’s MLPerf Inference v5.1 Submission

09 September 2025

In the rapidly evolving landscape of artificial intelligence, the demand for reliable and efficient model inference has never been greater. With advancements in large language models (LLMs) and a growing reliance on real-time applications, benchmarks are critical in evaluating how well AI systems perform under varying conditions. Enter MLPerf Inference: Datacenter v5.1 — a significant update to the well-respected benchmarking suite that assesses inference performance across a wide array of models and use cases, catering especially to data centers.

Read more ...

Slim Down Your Llama: Pruning & Fine-Tuning for Maximum Performance

09 September 2025

In this blog, we demonstrate how quantization, intelligent depth pruning and supervised fine-tuning can dramatically improve the inference performance of Meta’s Llama 3.1 405B model on AMD Instinct MI355X GPUs. By applying quantization and reducing the number of layers from the original 126, we are able to decrease memory requirements and boost token throughput. Additionally, with carefully applied fine-tuning, we maintain high inference accuracy for both RougeL and Exact Match metrics on MLPerf workloads. To see how these optimizations fit into AMD’s broader MLPerf Inference v5.1 efforts, read Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.1 Submission. For a detailed technical breakdown into other optimizations, check out our Technical Dive into AMD’s MLPerf Inference v5.1 Submission.

Read more ...

Reproducing the AMD Instinct™ GPUs MLPerf Inference v5.1 Submission

09 September 2025

MLPerf Inference v5.1 marks AMD’s third round of submissions and the most ambitious yet. This round features submissions on AMD Instinct MI325X and MI355X systems, including multi-node inference and models in MXFP4 datatype. Building upon the success in MLPerf Inference v5.0, AMD has submitted improved results for Llama 2 70B and SDXL on the MI325X platform in this round using new optimization techniques. For a deeper look at these optimizations, see our Technical Dive into AMD’s MLPerf Inference v5.1 Submission. Additionally, explore how we optimized Llama 3.1 405B through pruning and fine-tuning in Slim Down Your Llama: Pruning & Fine-Tuning for Maximum Performance. In addition, AMD has made submissions for the following workloads:

Read more ...