index

Vision Language Models: Architecture, Inference, and Practical Deployment

Author: Aadit Agrawal

Vision Language Models (VLMs) combine computer vision and natural language processing into unified systems capable of reasoning about images and text together. Unlike earlier approaches that treated vision and language as separate pipelines, modern VLMs learn joint representations that enable tasks like visual question answering, image captioning, document analysis, and multimodal reasoning.

This article covers VLM architecture from the ground up: how vision encoders transform pixels into tokens, how projection layers bridge modalities, and why inference is more complex than text-only LLMs. We also survey popular architectures, quantization strategies, and practical deployment options.


What VLMs Can Do

VLMs accept images (or videos) alongside text prompts and generate text responses. Concrete capabilities include:

  • Visual question answering: “What color is the car in the background?”
  • Document understanding: Extract text, tables, and structure from PDFs, invoices, forms
  • Chart and diagram analysis: Interpret graphs, flowcharts, architectural diagrams
  • OCR and text extraction: Read text embedded in images without separate OCR pipelines
  • Object localization: Identify bounding boxes or regions corresponding to natural language queries
  • Image captioning: Generate descriptions of visual content
  • Multi-image reasoning: Compare multiple images, summarize differences
  • Video understanding: Summarize events, answer questions about temporal sequences

The key shift from earlier vision models: VLMs use the same interface as chat LLMs. You prompt them with natural language, optionally include images, and receive text back. This unified interface simplifies integration into existing LLM-based applications.


How VLMs Work: Layer by Layer

Most VLMs follow a three-component architecture:

Image
1 Vision Encoder
ViT SigLIP CLIP
Converts image patches into feature vectors
Input 224x224 pixels
Output 256 patch tokens
patch embeddings
2 Projection Layer
MLP Linear Adapter
Bridges vision and language spaces
Transform 1024 → 4096 dim
aligned tokens
3 Language Model
LLaMA Qwen Gemma
Generates text from multimodal context
Context vision + text tokens
Text
Hover components for details

Let’s examine each component.

Vision Encoder

The vision encoder transforms raw pixels into a sequence of feature vectors. The dominant architecture is the Vision Transformer (ViT), introduced by Dosovitskiy et al. in 2020.

How ViT works:

  1. Patch extraction: The input image is divided into fixed-size patches (typically 14x14 or 16x16 pixels). A 224x224 image with 14x14 patches yields 256 patches.

  2. Linear embedding: Each patch is flattened and projected through a linear layer to create patch embeddings. For a 14x14x3 patch (588 values), this produces a vector of dimension D (commonly 768, 1024, or 1152).

  3. Position encoding: Since transformers have no inherent notion of spatial order, learnable position embeddings are added to each patch embedding. These encode the patch’s location in the original image grid.

  4. CLS token: A learnable classification token is prepended to the sequence. After processing, this token aggregates global image information.

  5. Transformer blocks: The sequence passes through standard transformer encoder layers (self-attention + feedforward). Each patch attends to all other patches, capturing global context.

The output is a sequence of vectors: one per patch plus the CLS token. For a 224x224 image with 14x14 patches, you get 257 vectors (256 patches + 1 CLS).

Popular vision encoders:

EncoderParametersPatch SizeOutput DimNotes
CLIP ViT-L/14428M14x141024Trained on 400M image-text pairs
SigLIP-So400m400M14x141152Sigmoid loss, better small-batch training
InternViT-6B6B14x143200Scaled up for InternVL
DaViT88MvariesvariesDual attention (spatial + channel)

CLIP vs SigLIP

CLIP (Contrastive Language-Image Pre-training) trains vision and text encoders jointly using a contrastive softmax loss. Given a batch of N image-text pairs, the model learns to match correct pairs while pushing incorrect pairs apart. The softmax normalization requires computing similarities across the entire batch.

SigLIP (Sigmoid Loss for Language-Image Pre-training) replaces softmax with sigmoid loss that operates on each image-text pair independently. This removes the need for cross-batch normalization, enabling:

  • Better performance with smaller batch sizes
  • Reduced memory usage (4096 batch on 4 TPUs vs 2048 for CLIP)
  • Comparable or better accuracy on downstream tasks

SigLIP 2 (February 2025) adds self-distillation, masked prediction, and captioning objectives for improved localization and dense prediction.

Projection/Adapter Layers

Vision encoders and language models operate in different representation spaces. The projection layer bridges this gap by transforming vision features into the language model’s embedding space.

Simple linear projection (LLaVA 1.0)

The original LLaVA used a single linear layer:

vision_features: [num_patches, 1024]


 Linear(1024, 4096)


projected_features: [num_patches, 4096]

This works but limits how much the vision representation can be transformed.

MLP projection (LLaVA 1.5+)

LLaVA 1.5 switched to a two-layer MLP with GELU activation:

vision_features: [num_patches, 1024]


 Linear(1024, 4096)


    GELU


 Linear(4096, 4096)


projected_features: [num_patches, 4096]

The additional nonlinearity allows more complex transformations and improved multimodal alignment.

Cross-attention adapters (Flamingo-style)

Some architectures use cross-attention layers that let language model tokens attend to vision features rather than concatenating them directly. This approach:

  • Keeps vision and language computations more separate
  • Allows selective attention to relevant image regions
  • Can be more parameter-efficient

Perceiver resampler

Perceiver-based adapters use a fixed number of learnable query tokens that cross-attend to the variable-length vision features, producing a constant number of output tokens regardless of image resolution.

Language Model Backbone

The language model processes the projected vision tokens alongside text tokens. Any decoder-only transformer can serve this role:

  • LLaMA family (7B, 13B, 70B)
  • Vicuna (instruction-tuned LLaMA)
  • Qwen (7B, 14B, 72B)
  • Phi-3 (3.8B)
  • Gemma (2B, 7B, 27B)
  • InternLM (7B, 20B)

The vision tokens are typically inserted at the position of special <image> tokens in the input. The language model then processes the combined sequence autoregressively, attending to both vision and text tokens.

Token sequence structure:

[BOS] [system prompt tokens] [image_tok_1] [image_tok_2] ... [image_tok_N] [user prompt tokens] [assistant response]

The number of image tokens depends on the vision encoder’s output and any token compression applied.

Position Encoding for Images

Standard language models use 1D position encoding. Images require 2D spatial information. VLMs handle this in several ways:

Flattened 1D positions: Simply assign positions sequentially to flattened patch tokens. Row-major order: patch (0,0) gets position 0, patch (0,1) gets position 1, etc.

Separate 2D position embeddings in ViT: The vision encoder learns its own 2D-aware position embeddings during pretraining. These spatial relationships are then encoded into the patch representations before projection.

M-RoPE (Multimodal Rotary Position Embedding): Qwen2-VL introduced M-RoPE, which encodes position information across text, images, and videos in a unified way. For images, it captures both spatial positions and temporal positions for video frames.

Variable Visual Position Encoding (V2PE): InternVL3 uses smaller, more flexible position increments for visual tokens, improving long-context understanding.


LLaVA Family

LLaVA (Large Language and Vision Assistant) established the dominant paradigm: frozen CLIP encoder + MLP projector + instruction-tuned LLM.

LLaVA 1.0 (2023)

  • Vision: CLIP ViT-L/14 (frozen)
  • Projector: Linear layer
  • LLM: Vicuna-13B
  • Training: Two-stage (projector pretraining, then full fine-tuning)

LLaVA 1.5 (2023)

  • Projector: Two-layer MLP (significant improvement)
  • Higher resolution support: 336x336
  • Better instruction-following data

LLaVA-NeXT / LLaVA 1.6 (2024)

  • Dynamic high resolution: Up to 672x672
  • AnyRes: Handles varying aspect ratios by tiling
  • Multiple LLM backends (Vicuna, Mistral, Hermes)

Training process:

Stage 1: Projector pretraining

  • Freeze vision encoder and LLM
  • Train only the MLP projector
  • Use image-caption pairs
  • Goal: Align vision features with language embedding space

Stage 2: Full fine-tuning

  • Freeze vision encoder
  • Unfreeze projector and LLM
  • Train on visual instruction data
  • Goal: Learn to follow multimodal instructions

Qwen-VL / Qwen2-VL / Qwen2.5-VL

Alibaba’s Qwen vision models introduced Naive Dynamic Resolution, processing images at their native resolution rather than forcing fixed sizes.

Key innovations:

Dynamic resolution: Images are converted to a variable number of visual tokens based on their actual dimensions. A small icon might use 256 tokens; a high-resolution document might use 4096+.

M-RoPE (Multimodal RoPE): Unified position encoding for text, images, and video. Enables the model to understand:

  • Absolute text positions
  • 2D spatial positions within images
  • Temporal positions across video frames

2D RoPE in ViT: The vision encoder itself uses 2D rotary position embeddings, enabling better adaptation to varying resolutions during inference.

Model sizes: 2B, 8B, 72B parameters.

Qwen2.5-VL improvements (2025):

  • ViT trained from scratch with native dynamic resolution
  • Window attention for reduced compute
  • Dynamic FPS sampling for video
  • Hours-long video understanding with second-level localization
  • Strong document/chart/table extraction

InternVL

OpenGVLab’s InternVL scales both vision and language components, with InternViT reaching 6B parameters.

Architecture: ViT-MLP-LLM paradigm with a pixel unshuffle operation that reduces visual tokens to 1/4 of the original count.

InternVL 2.5 / 3.0 (2025):

  • Native Multimodal Pre-Training: Interleaves vision-language data with text corpora in a single pretraining stage (rather than adapting a text-only model)
  • Variable Visual Position Encoding (V2PE)
  • Dynamic High Resolution from InternVL 1.5
  • Mixed Preference Optimization for alignment

InternVL 3.5 (August 2025):

  • Cascade Reinforcement Learning for improved reasoning
  • Visual Resolution Router (ViR): Dynamically adjusts visual token resolution
  • Decoupled Vision-Language Deployment: Separate vision and language servers for async pipelining
  • 4.05x inference speedup over InternVL3
  • State-of-the-art among open-source models

PaliGemma

Google’s PaliGemma combines SigLIP with Gemma in a straightforward architecture.

Components:

  • Vision: SigLIP-So400m (400M params)
  • Language: Gemma-2B
  • Projection: Linear layer (1152 to 2048 dimensions)

Resolution and tokens:

  • Pretrained at 224x224, 448x448, or 896x896
  • Patch size: 14
  • Token counts: 256 (224px), 1024 (448px), 4096 (896px)

PaliGemma 2 (2024):

  • Upgraded to Gemma 2
  • Sizes: 3B, 10B, 28B
  • Multiple resolution support per model

Gemma 3 with vision (2025):

  • Custom SigLIP encoder
  • Pan and Scan algorithm for varying aspect ratios
  • 896x896 fixed encoder input

Phi-3 Vision

Microsoft’s Phi-3 Vision packs vision capabilities into a 4.2B parameter model.

Architecture:

  • Image encoder
  • Connector
  • Projector
  • Phi-3 Mini language model

Specifications:

  • 4.2B total parameters
  • 128K context length
  • Inputs: Text and images

Strengths:

  • Chart, graph, and table understanding
  • Document analysis
  • Small footprint suitable for edge deployment

Phi-3.5 Vision:

  • Multi-frame capabilities (image comparison, video summarization)
  • Improved single-image benchmarks
  • Training: 6 days on 256 A100-80G GPUs with 500B tokens

CogVLM

CogVLM introduces a Visual Expert module that adds trainable parameters to the language model specifically for processing vision features.

Architecture difference: Instead of just projecting vision features into the LLM’s input space, CogVLM adds parallel QKV matrices and MLP layers for visual tokens at each transformer layer.

Standard layer:
  text_tokens ──▶ [QKV + MLP] ──▶ output

CogVLM layer:
  text_tokens  ──▶ [QKV_text + MLP_text]  ──┐
                                            ├──▶ output
  image_tokens ──▶ [QKV_vision + MLP_vision]──┘

Benefits:

  • Deep fusion of vision and language (not just early concatenation)
  • Preserves language model performance on text-only tasks
  • Visual expert parameters initialized from pretrained weights

Scale: CogVLM-17B has 10B vision parameters + 7B language parameters.

CogVLM2 (2024):

  • Improved training recipes
  • Up to 1344x1344 input resolution

Florence-2

Microsoft’s Florence-2 takes a different approach: a unified sequence-to-sequence model for all vision tasks.

Architecture:

  • Vision encoder: DaViT (Dual Attention Vision Transformer)
  • Text processing: BART-style encoder-decoder
  • Sizes: 0.2B and 0.7B parameters

Key insight: All vision tasks (detection, segmentation, captioning, grounding) can be formulated as sequence-to-sequence problems. The model outputs text, including coordinate tokens for localization tasks.

Training data: FLD-5B dataset with 5.4B annotations across 126M images (boxes, masks, captions, grounding).

Performance: Despite small size, Florence-2 outperforms much larger models on zero-shot captioning. On COCO, the 232M model (score: 133) and 771M model (score: 135.6) both beat DeepMind’s 80B Flamingo.


Why VLM Inference is More Complex

VLM inference presents challenges beyond text-only LLMs.

Variable Image Resolutions

Text inputs have predictable token counts (roughly 4 chars per token). Images vary wildly:

  • Thumbnail: 224x224 = 256 tokens (with 14x14 patches)
  • Document: 896x896 = 4096 tokens
  • High-res photo with dynamic tiling: 10000+ tokens

Dynamic resolution models like Qwen2-VL convert resolution directly to token count. Memory and compute scale accordingly.

Token Count Variability

A batch of requests might contain:

  • Request 1: 100 text tokens, no images
  • Request 2: 50 text tokens, 256 image tokens
  • Request 3: 200 text tokens, 2048 image tokens (high-res document)

This variability complicates batching and memory allocation.

Image Preprocessing Overhead

Before the vision encoder runs, images require preprocessing:

  1. Decode: Decompress JPEG/PNG to raw pixels
  2. Resize: Scale to encoder’s expected resolution
  3. Normalize: Convert to float, apply mean/std normalization (typically ImageNet stats: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
  4. Patch/tile: For dynamic resolution, split into tiles
  5. Tensor conversion: Move to GPU memory

This preprocessing happens on CPU and can bottleneck throughput when image counts are high.

Memory Requirements

High-resolution images consume memory at multiple stages:

  • Raw pixels: 896x896x3 = 2.4MB per image
  • Vision encoder activations: Intermediate states during forward pass
  • Projected tokens: 4096 tokens x 4096 dim x 2 bytes (fp16) = 33MB per high-res image
  • KV cache: Each image token needs KV cache entries for all subsequent generation

For a 72B model processing a high-res document, the image tokens alone can consume several GB of KV cache memory.

Batching Challenges

Text-only continuous batching works because all tokens are homogeneous. With images:

  • Vision encoder runs separately from LLM
  • Different images in a batch may have different resolutions
  • Prefill (processing prompt + images) is much heavier than decode (generating tokens)

vLLM’s solution: Hybrid parallelism where the vision encoder uses data parallelism (each GPU processes different images) while the LLM uses tensor parallelism. This avoids synchronization overhead during vision encoding.

Prefix caching: vLLM V1 supports prefix caching for multimodal inputs, so repeated images don’t require re-encoding.


VLM Quantization

Quantization reduces model size and speeds inference by using lower-precision weights and activations.

Quantizing the Vision Encoder

Vision encoders are often left at higher precision because:

  • They’re smaller than the LLM (typically < 1B params)
  • Visual features are sensitive to quantization artifacts
  • The encoder runs once per image (not autoregressive), so speed gains are less impactful

However, for edge deployment, encoder quantization matters. Research shows:

  • INT8 quantization typically maintains quality
  • INT4 can work with careful calibration
  • The vision modality is generally less sensitive than language

Quantizing the Language Model

Standard LLM quantization techniques apply:

  • GPTQ: Post-training quantization using calibration data
  • AWQ: Activation-aware weight quantization
  • GGUF: llama.cpp’s format with various quantization levels (Q4_K_M, Q5_K_S, etc.)
  • bitsandbytes: 4-bit and 8-bit for training/inference

The language model dominates parameter count (e.g., 72B LLM vs 400M vision encoder), so LLM quantization provides most of the size/speed benefits.

Joint Quantization Strategies

Q-VLM (NeurIPS 2024) proposes cross-layer dependency mining for VLM quantization. Rather than quantizing layer-by-layer, it considers how quantization errors propagate through the full vision-language pipeline.

Results: 2.78x memory compression, 1.44x speedup on 13B LLaVA without performance degradation.

MBQ (Modality-Balanced Quantization) (CVPR 2025) observes that language modules are more sensitive to quantization than vision modules. Their approach:

  • 4-bit quantization for vision (ViT modules)
  • 8-bit quantization for language
  • Up to 4% improvement under W3A16 and 11% under W4A8 vs uniform quantization

Quality Tradeoffs

Quantization affects different VLM capabilities differently:

  • OCR accuracy can degrade with aggressive quantization
  • Fine-grained visual details may be lost
  • Reasoning over complex diagrams is sensitive
  • General image description is more robust

Recommendation: Benchmark your specific use case. A model that’s fine for photo captioning might fail on document analysis when heavily quantized.


Practical VLM Inference

Hugging Face Transformers

The most straightforward approach for experimentation.

LLaVA example:

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
import requests

model_id = "llava-hf/llava-1.5-7b-hf"

# Load model and processor
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Prepare inputs
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What animal is in this image? Describe it briefly."}
        ]
    }
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)

# Generate
output = model.generate(**inputs, max_new_tokens=100, do_sample=False)
response = processor.decode(output[0], skip_special_tokens=True)
print(response)

Qwen2-VL example:

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Qwen2-VL supports dynamic resolution
image = Image.open("document.png")  # Any resolution

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Extract all text from this document."}
        ]
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)

Pipeline API (simpler):

from transformers import pipeline

pipe = pipeline("image-text-to-text", model="llava-hf/llava-1.5-7b-hf")

output = pipe(
    images="https://example.com/image.jpg",
    text="Describe this image in detail.",
    max_new_tokens=200
)

llama.cpp Multimodal

llama.cpp supports vision models via the mmproj (multimodal projector) architecture.

Installation:

# macOS
brew install llama.cpp

# Or build from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make

Download model and projector:

# Pre-quantized models available from ggml-org
# https://huggingface.co/collections/ggml-org/multimodal-ggufs

# Example: Qwen2.5-VL-7B
wget https://huggingface.co/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/qwen2.5-vl-7b-instruct-q4_k_m.gguf
wget https://huggingface.co/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/qwen2.5-vl-7b-instruct-vision.gguf

Run inference:

./llama-mtmd-cli \
    -m qwen2.5-vl-7b-instruct-q4_k_m.gguf \
    --mmproj qwen2.5-vl-7b-instruct-vision.gguf \
    -p "Describe this image:" \
    --image photo.jpg

Supported models:

  • Gemma 3 (4B, 12B, 27B)
  • Qwen 2 VL (2B, 7B)
  • Qwen 2.5 VL (3B, 7B, 32B, 72B)
  • SmolVLM variants
  • Pixtral 12B
  • InternVL 2.5
  • Mistral Small 3.1 24B

Limitations: Currently supports static images only. Video processing is not yet implemented.

vLLM

vLLM provides high-throughput VLM serving with continuous batching.

Installation:

pip install vllm

Offline inference:

from vllm import LLM, SamplingParams
from vllm.multimodal import MultiModalData

llm = LLM(model="llava-hf/llava-1.5-7b-hf")

prompt = "USER: <image>\nWhat is in this image?\nASSISTANT:"
sampling_params = SamplingParams(temperature=0.0, max_tokens=256)

# Single image
outputs = llm.generate(
    {
        "prompt": prompt,
        "multi_modal_data": {"image": "path/to/image.jpg"}
    },
    sampling_params
)
print(outputs[0].outputs[0].text)

Server deployment:

# Start server
python -m vllm.entrypoints.openai.api_server \
    --model llava-hf/llava-1.5-7b-hf \
    --chat-template llava

# Query via OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llava-hf/llava-1.5-7b-hf",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What is in this image?"},
                    {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
                ]
            }
        ]
    }'

Performance tuning:

# Use data parallelism for vision encoder
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2-VL-7B-Instruct \
    --mm-encoder-tp-mode data

# Limit image tokens to control memory
--limit-mm-per-prompt '{"image": 2048}'

Supported models (partial list):

  • LLaVA 1.5, LLaVA-NeXT
  • Qwen-VL, Qwen2-VL
  • InternVL, InternVL2
  • PaliGemma
  • Phi-3 Vision
  • BLIP-2
  • Chameleon

Ollama

Ollama provides the simplest local VLM experience.

Installation:

# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download from ollama.com

Run vision models:

# Pull a vision model
ollama pull llava

# Interactive mode
ollama run llava
>>> [paste image path or URL]
>>> What's in this image?

# Or with llava-phi3 (smaller, faster)
ollama pull llava-phi3
ollama run llava-phi3

API usage:

import ollama

response = ollama.chat(
    model='llava',
    messages=[{
        'role': 'user',
        'content': 'Describe this image',
        'images': ['./photo.jpg']
    }]
)
print(response['message']['content'])

Available vision models on Ollama:

  • llava (7B and 13B variants)
  • llava-phi3 (3.8B, faster)
  • bakllava
  • moondream (smaller, 1.8B)

Code Examples for Common Tasks

Document OCR

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Use high resolution for documents
image = Image.open("invoice.png")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Extract all text from this invoice. Format as structured data."}
    ]
}]

inputs = processor(
    text=processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True),
    images=[image],
    return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(output[0], skip_special_tokens=True))

Multi-Image Comparison

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image

model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")

image1 = Image.open("product_v1.jpg")
image2 = Image.open("product_v2.jpg")

prompt = "[INST] <image>\n<image>\nCompare these two product images. What are the differences? [/INST]"

inputs = processor(prompt, [image1, image2], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=300)
print(processor.decode(output[0], skip_special_tokens=True))

Chart Analysis

# Using Florence-2 for detailed chart understanding
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large",
    torch_dtype="auto",
    trust_remote_code=True
).to("cuda")
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)

image = Image.open("sales_chart.png")

# Florence-2 uses task-specific prompts
prompt = "<MORE_DETAILED_CAPTION>"

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=512,
    num_beams=3
)
result = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(result)

Batch Processing with vLLM

from vllm import LLM, SamplingParams
import os

llm = LLM(
    model="Qwen/Qwen2-VL-7B-Instruct",
    limit_mm_per_prompt={"image": 1}
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=256)

# Prepare batch of requests
image_dir = "./images"
requests = []
for img_file in os.listdir(image_dir):
    if img_file.endswith(('.jpg', '.png')):
        requests.append({
            "prompt": "USER: <image>\nDescribe this image briefly.\nASSISTANT:",
            "multi_modal_data": {"image": os.path.join(image_dir, img_file)}
        })

# Process batch
outputs = llm.generate(requests, sampling_params)

for i, output in enumerate(outputs):
    print(f"Image {i}: {output.outputs[0].text}")

References

Papers

  • Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (ViT, 2020)
  • Radford et al. “Learning Transferable Visual Models From Natural Language Supervision” (CLIP, 2021)
  • Zhai et al. “Sigmoid Loss for Language Image Pre-Training” (SigLIP, ICCV 2023) - https://arxiv.org/abs/2303.15343
  • Liu et al. “Visual Instruction Tuning” (LLaVA, NeurIPS 2023)
  • Wang et al. “Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution” (2024) - https://arxiv.org/abs/2409.12191
  • Qwen Team. “Qwen2.5-VL Technical Report” (2025) - https://arxiv.org/abs/2502.13923
  • Chen et al. “InternVL: Scaling up Vision Foundation Models” (CVPR 2024)
  • OpenGVLab. “InternVL3” (2025) - https://internvl.github.io/blog/2025-04-11-InternVL-3.0/
  • Wang et al. “CogVLM: Visual Expert for Pretrained Language Models” (NeurIPS 2024) - https://arxiv.org/abs/2311.03079
  • Xiao et al. “Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks” (2024)
  • Shao et al. “Q-VLM: Post-training Quantization for Large Vision-Language Models” (NeurIPS 2024) - https://arxiv.org/abs/2410.08119
  • Li et al. “MBQ: Modality-Balanced Quantization for Large Vision-Language Models” (CVPR 2025)
  • Vasu et al. “FastVLM: Efficient Vision Encoding for Vision Language Models” (CVPR 2025)

Model Resources

Inference Tools

Tutorials and Guides