Posts tagged Multimodal

Athena-PRM: Enhancing Multimodal Reasoning with Data-Efficient Process Reward Models

This blog introduces Athena-PRM, a multimodal Process Reward Model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. To efficiently generate high-quality process-labeled data, we leverage prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. We also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data.

Read more ...


Accelerating Multimodal Inference in vLLM: The One-Line Optimization for Large Multimodal Models

Deploying multimodal models like Qwen3-VL or InternVL at scale reveals a hidden bottleneck. While Tensor Parallelism (TP) is essential for massive language decoders, it is often overkill for vision encoders. These encoders are typically small, often just 1-5% of total model size, so there is limited compute benefit from sharding them. However, they still incur expensive all-reduce communication costs after every single layer.

Read more ...


Introducing AMD EVLM: Efficient Vision-Language Models with Parameter-Space Visual Conditioning

This blog introduces a novel and computationally efficient paradigm for Vision-Language Models (VLMs), which diverges from the conventional method of prepending visual tokens to textual input. Instead of elongating the input sequence, this approach injects visual information directly into the Large Language Model’s (LLM) parameters. It achieves this by using a vision encoder to extract image features and then employing a perceptual weight generator to transform these features into dynamic, low-rank adapter weights. These weights are temporarily integrated with the LLM’s parameters, effectively conditioning the model on the image without increasing the input length. This mechanism allows the model to achieve performance comparable to traditional VLMs on standard benchmarks while significantly reducing computational costs during inference.

Read more ...


Instella-VL-1B: First AMD Vision Language Model

As part of AMD’s newly released Instella family we are thrilled to introduce Instella-VL-1B, the first AMD vision language model for image understanding trained on AMD Instinct™ MI300X GPUs. Our journey with Instella-VL builds upon our previous 1-billion-parameter language models, AMD OLMo SFT. We further extend the language model’s visual understanding abilities by connecting it with a vision encoder (which is initialized from CLIP ViT-L/14-336). During training, we jointly finetune vision encoder and language model with vision-language data in three stages: Alignment, Pretraining and Supervised-Finetuning (SFT).

Read more ...


Multimodal (Visual and Language) understanding with LLaVA-NeXT

26, Apr 2024 by

.

Read more ...


Interacting with Contrastive Language-Image Pre-Training (CLIP) model on AMD GPU

16, Apr 2024 by

.

Read more ...