Posts by Matthias Reso

Chain-of-Thought Guided Visual Reasoning Using Llama 3.2 on a Single AMD Instinct MI300X GPU

In this post, we will show you how to fine-tune the Llama 3.2 Vision Instruct models, specifically the 11B and 90B parameter variants, on a synthetic multi-modal dataset using torchtune. This blog focuses on chain-of-thought (CoT) guided visual reasoning, a technique where the model is encouraged to articulate intermediate reasoning steps before arriving at a final answer. By incorporating the CoT approach, we aim to improve the model’s interpretability and accuracy in tasks that require multi-step understanding of visual inputs. By utilizing the high-bandwidth memory (HBM) of the AMD Instinct™ MI300X GPU, we aim to enhance the model’s vision understanding, particularly for interpreting charts, all on a single GPU provided by TensorWave. Our evaluation shows that we can train an 11B parameter model to perform with 2.3x better accuracy than a 90B parameter model. The blog will walk you through our dataset preparation, model configuration, training recipes, and evaluation—all optimized to run on a single GPU.

Read more ...