Vision Language Models: Architecture, Inference, and Practical Deployment

2026.01.07

Author: Aadit Agrawal

Vision Language Models (VLMs) combine computer vision and natural language processing into unified systems capable of reasoning about images and text together. Unlike earlier approaches that treated vision and language as separate pipelines, modern VLMs learn joint representations that enable tasks like visual question answering, image captioning, document analysis, and multimodal reasoning.

This article covers VLM architecture from the ground up: how vision encoders transform pixels into tokens, how projection layers bridge modalities, and why inference is more complex than text-only LLMs. We also survey popular architectures, quantization strategies, and practical deployment options.

What VLMs Can Do

VLMs accept images (or videos) alongside text prompts and generate text responses. Concrete capabilities include:

Visual question answering: “What color is the car in the background?”
Document understanding: Extract text, tables, and structure from PDFs, invoices, forms
Chart and diagram analysis: Interpret graphs, flowcharts, architectural diagrams
OCR and text extraction: Read text embedded in images without separate OCR pipelines
Object localization: Identify bounding boxes or regions corresponding to natural language queries
Image captioning: Generate descriptions of visual content
Multi-image reasoning: Compare multiple images, summarize differences
Video understanding: Summarize events, answer questions about temporal sequences

The key shift from earlier vision models: VLMs use the same interface as chat LLMs. You prompt them with natural language, optionally include images, and receive text back. This unified interface simplifies integration into existing LLM-based applications.

How VLMs Work: Layer by Layer

Most VLMs follow a three-component architecture:

Image

1 Vision Encoder

ViT SigLIP CLIP

Converts image patches into feature vectors

Input 224x224 pixels

Output 256 patch tokens

patch embeddings

2 Projection Layer

MLP Linear Adapter

Bridges vision and language spaces

Transform 1024 → 4096 dim

aligned tokens

3 Language Model

LLaMA Qwen Gemma

Generates text from multimodal context

Context vision + text tokens

Text

Hover components for details

Let’s examine each component.

Vision Encoder

The vision encoder transforms raw pixels into a sequence of feature vectors. The dominant architecture is the Vision Transformer (ViT), introduced by Dosovitskiy et al. in 2020.

How ViT works:

Patch extraction: The input image is divided into fixed-size patches (typically 14x14 or 16x16 pixels). A 224x224 image with 14x14 patches yields 256 patches.
Linear embedding: Each patch is flattened and projected through a linear layer to create patch embeddings. For a 14x14x3 patch (588 values), this produces a vector of dimension D (commonly 768, 1024, or 1152).
Position encoding: Since transformers have no inherent notion of spatial order, learnable position embeddings are added to each patch embedding. These encode the patch’s location in the original image grid.
CLS token: A learnable classification token is prepended to the sequence. After processing, this token aggregates global image information.
Transformer blocks: The sequence passes through standard transformer encoder layers (self-attention + feedforward). Each patch attends to all other patches, capturing global context.

The output is a sequence of vectors: one per patch plus the CLS token. For a 224x224 image with 14x14 patches, you get 257 vectors (256 patches + 1 CLS).

Popular vision encoders:

Encoder	Parameters	Patch Size	Output Dim	Notes
CLIP ViT-L/14	428M	14x14	1024	Trained on 400M image-text pairs
SigLIP-So400m	400M	14x14	1152	Sigmoid loss, better small-batch training
InternViT-6B	6B	14x14	3200	Scaled up for InternVL
DaViT	88M	varies	varies	Dual attention (spatial + channel)

CLIP vs SigLIP

CLIP (Contrastive Language-Image Pre-training) trains vision and text encoders jointly using a contrastive softmax loss. Given a batch of N image-text pairs, the model learns to match correct pairs while pushing incorrect pairs apart. The softmax normalization requires computing similarities across the entire batch.

SigLIP (Sigmoid Loss for Language-Image Pre-training) replaces softmax with sigmoid loss that operates on each image-text pair independently. This removes the need for cross-batch normalization, enabling:

Better performance with smaller batch sizes
Reduced memory usage (4096 batch on 4 TPUs vs 2048 for CLIP)
Comparable or better accuracy on downstream tasks

SigLIP 2 (February 2025) adds self-distillation, masked prediction, and captioning objectives for improved localization and dense prediction.

Projection/Adapter Layers

Vision encoders and language models operate in different representation spaces. The projection layer bridges this gap by transforming vision features into the language model’s embedding space.

Simple linear projection (LLaVA 1.0)

The original LLaVA used a single linear layer:

vision_features: [num_patches, 1024]
      │
      ▼
 Linear(1024, 4096)
      │
      ▼
projected_features: [num_patches, 4096]

This works but limits how much the vision representation can be transformed.

MLP projection (LLaVA 1.5+)

LLaVA 1.5 switched to a two-layer MLP with GELU activation:

vision_features: [num_patches, 1024]
      │
      ▼
 Linear(1024, 4096)
      │
      ▼
    GELU
      │
      ▼
 Linear(4096, 4096)
      │
      ▼
projected_features: [num_patches, 4096]

The additional nonlinearity allows more complex transformations and improved multimodal alignment.

Cross-attention adapters (Flamingo-style)

Some architectures use cross-attention layers that let language model tokens attend to vision features rather than concatenating them directly. This approach:

Keeps vision and language computations more separate
Allows selective attention to relevant image regions
Can be more parameter-efficient

Perceiver resampler

Perceiver-based adapters use a fixed number of learnable query tokens that cross-attend to the variable-length vision features, producing a constant number of output tokens regardless of image resolution.

Language Model Backbone

The language model processes the projected vision tokens alongside text tokens. Any decoder-only transformer can serve this role:

LLaMA family (7B, 13B, 70B)
Vicuna (instruction-tuned LLaMA)
Qwen (7B, 14B, 72B)
Phi-3 (3.8B)
Gemma (2B, 7B, 27B)
InternLM (7B, 20B)

The vision tokens are typically inserted at the position of special <image> tokens in the input. The language model then processes the combined sequence autoregressively, attending to both vision and text tokens.

Token sequence structure:

[BOS] [system prompt tokens] [image_tok_1] [image_tok_2] ... [image_tok_N] [user prompt tokens] [assistant response]

The number of image tokens depends on the vision encoder’s output and any token compression applied.

Position Encoding for Images

Standard language models use 1D position encoding. Images require 2D spatial information. VLMs handle this in several ways:

Flattened 1D positions: Simply assign positions sequentially to flattened patch tokens. Row-major order: patch (0,0) gets position 0, patch (0,1) gets position 1, etc.

Separate 2D position embeddings in ViT: The vision encoder learns its own 2D-aware position embeddings during pretraining. These spatial relationships are then encoded into the patch representations before projection.

M-RoPE (Multimodal Rotary Position Embedding): Qwen2-VL introduced M-RoPE, which encodes position information across text, images, and videos in a unified way. For images, it captures both spatial positions and temporal positions for video frames.

Variable Visual Position Encoding (V2PE): InternVL3 uses smaller, more flexible position increments for visual tokens, improving long-context understanding.

Popular VLM Architectures

LLaVA Family

LLaVA (Large Language and Vision Assistant) established the dominant paradigm: frozen CLIP encoder + MLP projector + instruction-tuned LLM.

LLaVA 1.0 (2023)

Vision: CLIP ViT-L/14 (frozen)
Projector: Linear layer
LLM: Vicuna-13B
Training: Two-stage (projector pretraining, then full fine-tuning)

LLaVA 1.5 (2023)

Projector: Two-layer MLP (significant improvement)
Higher resolution support: 336x336
Better instruction-following data

LLaVA-NeXT / LLaVA 1.6 (2024)

Dynamic high resolution: Up to 672x672
AnyRes: Handles varying aspect ratios by tiling
Multiple LLM backends (Vicuna, Mistral, Hermes)

Training process:

Stage 1: Projector pretraining

Freeze vision encoder and LLM
Train only the MLP projector
Use image-caption pairs
Goal: Align vision features with language embedding space

Stage 2: Full fine-tuning

Freeze vision encoder
Unfreeze projector and LLM
Train on visual instruction data
Goal: Learn to follow multimodal instructions

Qwen-VL / Qwen2-VL / Qwen2.5-VL

Alibaba’s Qwen vision models introduced Naive Dynamic Resolution, processing images at their native resolution rather than forcing fixed sizes.

Key innovations:

Dynamic resolution: Images are converted to a variable number of visual tokens based on their actual dimensions. A small icon might use 256 tokens; a high-resolution document might use 4096+.

M-RoPE (Multimodal RoPE): Unified position encoding for text, images, and video. Enables the model to understand:

Absolute text positions
2D spatial positions within images
Temporal positions across video frames

2D RoPE in ViT: The vision encoder itself uses 2D rotary position embeddings, enabling better adaptation to varying resolutions during inference.

Model sizes: 2B, 8B, 72B parameters.

Qwen2.5-VL improvements (2025):

ViT trained from scratch with native dynamic resolution
Window attention for reduced compute
Dynamic FPS sampling for video
Hours-long video understanding with second-level localization
Strong document/chart/table extraction

InternVL

OpenGVLab’s InternVL scales both vision and language components, with InternViT reaching 6B parameters.

Architecture: ViT-MLP-LLM paradigm with a pixel unshuffle operation that reduces visual tokens to 1/4 of the original count.

InternVL 2.5 / 3.0 (2025):

Native Multimodal Pre-Training: Interleaves vision-language data with text corpora in a single pretraining stage (rather than adapting a text-only model)
Variable Visual Position Encoding (V2PE)
Dynamic High Resolution from InternVL 1.5
Mixed Preference Optimization for alignment

InternVL 3.5 (August 2025):

Cascade Reinforcement Learning for improved reasoning
Visual Resolution Router (ViR): Dynamically adjusts visual token resolution
Decoupled Vision-Language Deployment: Separate vision and language servers for async pipelining
4.05x inference speedup over InternVL3
State-of-the-art among open-source models

PaliGemma

Google’s PaliGemma combines SigLIP with Gemma in a straightforward architecture.

Components:

Vision: SigLIP-So400m (400M params)
Language: Gemma-2B
Projection: Linear layer (1152 to 2048 dimensions)

Resolution and tokens:

Pretrained at 224x224, 448x448, or 896x896
Patch size: 14
Token counts: 256 (224px), 1024 (448px), 4096 (896px)

PaliGemma 2 (2024):

Upgraded to Gemma 2
Sizes: 3B, 10B, 28B
Multiple resolution support per model

Gemma 3 with vision (2025):

Custom SigLIP encoder
Pan and Scan algorithm for varying aspect ratios
896x896 fixed encoder input

Phi-3 Vision

Microsoft’s Phi-3 Vision packs vision capabilities into a 4.2B parameter model.

Architecture:

Image encoder
Connector
Projector
Phi-3 Mini language model

Specifications:

4.2B total parameters
128K context length
Inputs: Text and images

Strengths:

Chart, graph, and table understanding
Document analysis
Small footprint suitable for edge deployment

Phi-3.5 Vision:

Multi-frame capabilities (image comparison, video summarization)
Improved single-image benchmarks
Training: 6 days on 256 A100-80G GPUs with 500B tokens

CogVLM

CogVLM introduces a Visual Expert module that adds trainable parameters to the language model specifically for processing vision features.

Architecture difference: Instead of just projecting vision features into the LLM’s input space, CogVLM adds parallel QKV matrices and MLP layers for visual tokens at each transformer layer.

Standard layer:
  text_tokens ──▶ [QKV + MLP] ──▶ output

CogVLM layer:
  text_tokens  ──▶ [QKV_text + MLP_text]  ──┐
                                            ├──▶ output
  image_tokens ──▶ [QKV_vision + MLP_vision]──┘

Benefits:

Deep fusion of vision and language (not just early concatenation)
Preserves language model performance on text-only tasks
Visual expert parameters initialized from pretrained weights

Scale: CogVLM-17B has 10B vision parameters + 7B language parameters.

CogVLM2 (2024):

Improved training recipes
Up to 1344x1344 input resolution

Florence-2

Microsoft’s Florence-2 takes a different approach: a unified sequence-to-sequence model for all vision tasks.

Architecture:

Vision encoder: DaViT (Dual Attention Vision Transformer)
Text processing: BART-style encoder-decoder
Sizes: 0.2B and 0.7B parameters

Key insight: All vision tasks (detection, segmentation, captioning, grounding) can be formulated as sequence-to-sequence problems. The model outputs text, including coordinate tokens for localization tasks.

Training data: FLD-5B dataset with 5.4B annotations across 126M images (boxes, masks, captions, grounding).

Performance: Despite small size, Florence-2 outperforms much larger models on zero-shot captioning. On COCO, the 232M model (score: 133) and 771M model (score: 135.6) both beat DeepMind’s 80B Flamingo.

Why VLM Inference is More Complex

VLM inference presents challenges beyond text-only LLMs.

Variable Image Resolutions

Text inputs have predictable token counts (roughly 4 chars per token). Images vary wildly:

Thumbnail: 224x224 = 256 tokens (with 14x14 patches)
Document: 896x896 = 4096 tokens
High-res photo with dynamic tiling: 10000+ tokens

Dynamic resolution models like Qwen2-VL convert resolution directly to token count. Memory and compute scale accordingly.

Token Count Variability

A batch of requests might contain:

Request 1: 100 text tokens, no images
Request 2: 50 text tokens, 256 image tokens
Request 3: 200 text tokens, 2048 image tokens (high-res document)

This variability complicates batching and memory allocation.

Image Preprocessing Overhead

Before the vision encoder runs, images require preprocessing:

Decode: Decompress JPEG/PNG to raw pixels
Resize: Scale to encoder’s expected resolution
Normalize: Convert to float, apply mean/std normalization (typically ImageNet stats: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
Patch/tile: For dynamic resolution, split into tiles
Tensor conversion: Move to GPU memory

This preprocessing happens on CPU and can bottleneck throughput when image counts are high.

Memory Requirements

High-resolution images consume memory at multiple stages:

Raw pixels: 896x896x3 = 2.4MB per image
Vision encoder activations: Intermediate states during forward pass
Projected tokens: 4096 tokens x 4096 dim x 2 bytes (fp16) = 33MB per high-res image
KV cache: Each image token needs KV cache entries for all subsequent generation

For a 72B model processing a high-res document, the image tokens alone can consume several GB of KV cache memory.

Batching Challenges

Text-only continuous batching works because all tokens are homogeneous. With images:

Vision encoder runs separately from LLM
Different images in a batch may have different resolutions
Prefill (processing prompt + images) is much heavier than decode (generating tokens)

vLLM’s solution: Hybrid parallelism where the vision encoder uses data parallelism (each GPU processes different images) while the LLM uses tensor parallelism. This avoids synchronization overhead during vision encoding.

Prefix caching: vLLM V1 supports prefix caching for multimodal inputs, so repeated images don’t require re-encoding.

VLM Quantization

Quantization reduces model size and speeds inference by using lower-precision weights and activations.

Quantizing the Vision Encoder

Vision encoders are often left at higher precision because:

They’re smaller than the LLM (typically < 1B params)
Visual features are sensitive to quantization artifacts
The encoder runs once per image (not autoregressive), so speed gains are less impactful

However, for edge deployment, encoder quantization matters. Research shows:

INT8 quantization typically maintains quality
INT4 can work with careful calibration
The vision modality is generally less sensitive than language

Quantizing the Language Model

Standard LLM quantization techniques apply:

GPTQ: Post-training quantization using calibration data
AWQ: Activation-aware weight quantization
GGUF: llama.cpp’s format with various quantization levels (Q4_K_M, Q5_K_S, etc.)
bitsandbytes: 4-bit and 8-bit for training/inference

The language model dominates parameter count (e.g., 72B LLM vs 400M vision encoder), so LLM quantization provides most of the size/speed benefits.

Joint Quantization Strategies

Q-VLM (NeurIPS 2024) proposes cross-layer dependency mining for VLM quantization. Rather than quantizing layer-by-layer, it considers how quantization errors propagate through the full vision-language pipeline.

Results: 2.78x memory compression, 1.44x speedup on 13B LLaVA without performance degradation.

MBQ (Modality-Balanced Quantization) (CVPR 2025) observes that language modules are more sensitive to quantization than vision modules. Their approach:

4-bit quantization for vision (ViT modules)
8-bit quantization for language
Up to 4% improvement under W3A16 and 11% under W4A8 vs uniform quantization

Quality Tradeoffs

Quantization affects different VLM capabilities differently:

OCR accuracy can degrade with aggressive quantization
Fine-grained visual details may be lost
Reasoning over complex diagrams is sensitive
General image description is more robust

Recommendation: Benchmark your specific use case. A model that’s fine for photo captioning might fail on document analysis when heavily quantized.

Practical VLM Inference

Hugging Face Transformers

The most straightforward approach for experimentation.

LLaVA example:

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
import requests

model_id = "llava-hf/llava-1.5-7b-hf"

# Load model and processor
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Prepare inputs
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What animal is in this image? Describe it briefly."}
        ]
    }
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)

# Generate
output = model.generate(**inputs, max_new_tokens=100, do_sample=False)
response = processor.decode(output[0], skip_special_tokens=True)
print(response)

Qwen2-VL example:

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Qwen2-VL supports dynamic resolution
image = Image.open("document.png")  # Any resolution

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Extract all text from this document."}
        ]
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)

Pipeline API (simpler):

from transformers import pipeline

pipe = pipeline("image-text-to-text", model="llava-hf/llava-1.5-7b-hf")

output = pipe(
    images="https://example.com/image.jpg",
    text="Describe this image in detail.",
    max_new_tokens=200
)

llama.cpp Multimodal

llama.cpp supports vision models via the mmproj (multimodal projector) architecture.

Installation:

# macOS
brew install llama.cpp

# Or build from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make

Download model and projector:

# Pre-quantized models available from ggml-org
# https://huggingface.co/collections/ggml-org/multimodal-ggufs

# Example: Qwen2.5-VL-7B
wget https://huggingface.co/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/qwen2.5-vl-7b-instruct-q4_k_m.gguf
wget https://huggingface.co/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/qwen2.5-vl-7b-instruct-vision.gguf

Run inference:

./llama-mtmd-cli \
    -m qwen2.5-vl-7b-instruct-q4_k_m.gguf \
    --mmproj qwen2.5-vl-7b-instruct-vision.gguf \
    -p "Describe this image:" \
    --image photo.jpg

Supported models:

Gemma 3 (4B, 12B, 27B)
Qwen 2 VL (2B, 7B)
Qwen 2.5 VL (3B, 7B, 32B, 72B)
SmolVLM variants
Pixtral 12B
InternVL 2.5
Mistral Small 3.1 24B

Limitations: Currently supports static images only. Video processing is not yet implemented.

vLLM

vLLM provides high-throughput VLM serving with continuous batching.

Installation:

pip install vllm

Offline inference:

from vllm import LLM, SamplingParams
from vllm.multimodal import MultiModalData

llm = LLM(model="llava-hf/llava-1.5-7b-hf")

prompt = "USER: <image>\nWhat is in this image?\nASSISTANT:"
sampling_params = SamplingParams(temperature=0.0, max_tokens=256)

# Single image
outputs = llm.generate(
    {
        "prompt": prompt,
        "multi_modal_data": {"image": "path/to/image.jpg"}
    },
    sampling_params
)
print(outputs[0].outputs[0].text)

Server deployment:

# Start server
python -m vllm.entrypoints.openai.api_server \
    --model llava-hf/llava-1.5-7b-hf \
    --chat-template llava

# Query via OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llava-hf/llava-1.5-7b-hf",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What is in this image?"},
                    {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
                ]
            }
        ]
    }'

Performance tuning:

# Use data parallelism for vision encoder
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2-VL-7B-Instruct \
    --mm-encoder-tp-mode data

# Limit image tokens to control memory
--limit-mm-per-prompt '{"image": 2048}'

Supported models (partial list):

LLaVA 1.5, LLaVA-NeXT
Qwen-VL, Qwen2-VL
InternVL, InternVL2
PaliGemma
Phi-3 Vision
BLIP-2
Chameleon

Ollama

Ollama provides the simplest local VLM experience.

Installation:

# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download from ollama.com

Run vision models:

# Pull a vision model
ollama pull llava

# Interactive mode
ollama run llava
>>> [paste image path or URL]
>>> What's in this image?

# Or with llava-phi3 (smaller, faster)
ollama pull llava-phi3
ollama run llava-phi3

API usage:

import ollama

response = ollama.chat(
    model='llava',
    messages=[{
        'role': 'user',
        'content': 'Describe this image',
        'images': ['./photo.jpg']
    }]
)
print(response['message']['content'])

Available vision models on Ollama:

llava (7B and 13B variants)
llava-phi3 (3.8B, faster)
bakllava
moondream (smaller, 1.8B)

Code Examples for Common Tasks

Document OCR

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Use high resolution for documents
image = Image.open("invoice.png")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Extract all text from this invoice. Format as structured data."}
    ]
}]

inputs = processor(
    text=processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True),
    images=[image],
    return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(output[0], skip_special_tokens=True))

Multi-Image Comparison

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image

model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")

image1 = Image.open("product_v1.jpg")
image2 = Image.open("product_v2.jpg")

prompt = "[INST] <image>\n<image>\nCompare these two product images. What are the differences? [/INST]"

inputs = processor(prompt, [image1, image2], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=300)
print(processor.decode(output[0], skip_special_tokens=True))

Chart Analysis

# Using Florence-2 for detailed chart understanding
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large",
    torch_dtype="auto",
    trust_remote_code=True
).to("cuda")
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)

image = Image.open("sales_chart.png")

# Florence-2 uses task-specific prompts
prompt = "<MORE_DETAILED_CAPTION>"

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=512,
    num_beams=3
)
result = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(result)

Batch Processing with vLLM

from vllm import LLM, SamplingParams
import os

llm = LLM(
    model="Qwen/Qwen2-VL-7B-Instruct",
    limit_mm_per_prompt={"image": 1}
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=256)

# Prepare batch of requests
image_dir = "./images"
requests = []
for img_file in os.listdir(image_dir):
    if img_file.endswith(('.jpg', '.png')):
        requests.append({
            "prompt": "USER: <image>\nDescribe this image briefly.\nASSISTANT:",
            "multi_modal_data": {"image": os.path.join(image_dir, img_file)}
        })

# Process batch
outputs = llm.generate(requests, sampling_params)

for i, output in enumerate(outputs):
    print(f"Image {i}: {output.outputs[0].text}")

References

Papers

Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (ViT, 2020)
Radford et al. “Learning Transferable Visual Models From Natural Language Supervision” (CLIP, 2021)
Zhai et al. “Sigmoid Loss for Language Image Pre-Training” (SigLIP, ICCV 2023) - https://arxiv.org/abs/2303.15343
Liu et al. “Visual Instruction Tuning” (LLaVA, NeurIPS 2023)
Wang et al. “Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution” (2024) - https://arxiv.org/abs/2409.12191
Qwen Team. “Qwen2.5-VL Technical Report” (2025) - https://arxiv.org/abs/2502.13923
Chen et al. “InternVL: Scaling up Vision Foundation Models” (CVPR 2024)
OpenGVLab. “InternVL3” (2025) - https://internvl.github.io/blog/2025-04-11-InternVL-3.0/
Wang et al. “CogVLM: Visual Expert for Pretrained Language Models” (NeurIPS 2024) - https://arxiv.org/abs/2311.03079
Xiao et al. “Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks” (2024)
Shao et al. “Q-VLM: Post-training Quantization for Large Vision-Language Models” (NeurIPS 2024) - https://arxiv.org/abs/2410.08119
Li et al. “MBQ: Modality-Balanced Quantization for Large Vision-Language Models” (CVPR 2025)
Vasu et al. “FastVLM: Efficient Vision Encoding for Vision Language Models” (CVPR 2025)

Vision Language Models: Architecture, Inference, and Practical Deployment

What VLMs Can Do

How VLMs Work: Layer by Layer

Vision Encoder

Projection/Adapter Layers

Language Model Backbone

Position Encoding for Images

Popular VLM Architectures

LLaVA Family

Qwen-VL / Qwen2-VL / Qwen2.5-VL

InternVL

PaliGemma

Phi-3 Vision

CogVLM

Florence-2

Why VLM Inference is More Complex

Variable Image Resolutions

Token Count Variability

Image Preprocessing Overhead

Memory Requirements

Batching Challenges

VLM Quantization

Quantizing the Vision Encoder

Quantizing the Language Model

Joint Quantization Strategies

Quality Tradeoffs

Practical VLM Inference

Hugging Face Transformers

llama.cpp Multimodal

vLLM

Ollama

Code Examples for Common Tasks

Document OCR

Multi-Image Comparison

Chart Analysis

Batch Processing with vLLM

References

Papers

Model Resources

Inference Tools

Tutorials and Guides