Glossary/Multimodal AI
AI Fundamentals

Multimodal AI

AI that understands images, audio, and video — not just text.


Definition

Multimodal AI refers to models that can process and reason across multiple types of input simultaneously — typically text and images, but increasingly audio, video, documents, and structured data. Models like GPT-4o, Gemini 1.5 Pro, and Claude 3.5 are multimodal — they can analyze images, describe visual content, solve visual problems, and reason about information across different modalities.

How multimodal models work

Multimodal models use modality-specific encoders that map each input type into a shared vector embedding space, then merge everything for joint Transformer processing:

  1. Image encoder: A Vision Transformer (ViT) divides the image into 16×16 or 32×32 pixel patches. Each patch → linear embedding → positional encoding → Transformer. Output: a sequence of patch embeddings with the same dimension as text tokens.
  2. Audio encoder: A log-mel spectrogram is computed from the raw waveform, then processed by a Transformer encoder (e.g., Whisper's encoder). Output: a sequence of audio embeddings.
  3. Fusion: Image/audio embeddings are concatenated or interleaved with text token embeddings. The combined sequence is passed through the main language model.
  4. Training: Models are trained on paired multimodal data (image-caption pairs, video-transcript pairs) using a contrastive or generative objective to align representations.
ModelModalitiesVision encoderContext
GPT-4oText, image, audioCustom ViT128K tokens
Claude 3.5 SonnetText, imageCustom ViT200K tokens
Gemini 1.5 ProText, image, audio, videoNative multimodal1M tokens
LLaMA 3.2 VisionText, imageViT-L/14128K tokens

Practical uses for students and developers

Multimodal capabilities unlock entirely new workflows that pure text models cannot support:

Use caseInputWhat the model does
Solve handwritten mathPhoto of notebookOCR + parse + solve equation step-by-step
Explain textbook diagramPhoto of figure in bookIdentify, describe, and explain the visual concept
Debug screenshot errorsScreenshot of error/terminalRead error text + suggest fixes
Summarize handwritten notesPhoto of notes pageTranscribe + organize into structured summary
Analyze research chartsImage of chart/graphRead axes, values, trends, and interpret findings
Extract table dataPhoto of printed tableConvert to CSV/JSON format for further processing
UI feedback for developersScreenshot of UIIdentify layout issues, accessibility, UX suggestions

Sending an image to Claude with a text question

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def ask_about_image(image_path: str, question: str) -> str:
    image_data = base64.standard_b64encode(
        Path(image_path).read_bytes()
    ).decode("utf-8")

    # Detect format from extension
    ext = Path(image_path).suffix.lower()
    media_type = {".jpg": "image/jpeg", ".png": "image/png",
                  ".gif": "image/gif", ".webp": "image/webp"}[ext]

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data,
                    },
                },
                {"type": "text", "text": question}
            ],
        }]
    )
    return response.content[0].text

# Example uses
answer = ask_about_image("textbook_page.jpg",
    "Explain the diagram on this page and summarize the key concept.")
print(answer)

Image resolution and token costs

Vision models process images by dividing them into tiles or patches. Higher resolution means more tiles, more tokens, and higher cost and latency. Understanding this tradeoff helps you optimize for your use case:

ModelLow-detail modeHigh-detail modeMax resolutionNotes
GPT-4o~85 tokens (512px tile)Up to ~2,000 tokens (multiple tiles)2048×2048Each 512×512 tile = ~170 tokens
Claude 3.5 Sonnet~1,600 tokensUp to ~4,000 tokens8000×8000 (downscaled)Scales based on image dimensions
Gemini 1.5 Pro~258 tokensUp to ~1,024 tokens per frameNo fixed limitVideo: ~263 tokens per frame

Optimize image quality for your task

For reading text from documents, use high-detail mode and scan at ≥300 DPI. For general diagram description, low-detail mode is often sufficient and 4–8× cheaper. Never send unnecessarily large images — resize to the minimum required for your task before sending to the API.

Text in images (OCR)

Reading text from images (Optical Character Recognition) is one of the highest-value multimodal use cases. Vision LLMs dramatically outperform traditional OCR systems on complex real-world inputs:

Input typeTraditional OCR (Tesseract)Vision LLM (Claude/GPT-4o)
Printed text, clean scanExcellent (>99% accuracy)Excellent (>99%)
Handwritten textPoor (20–60% accuracy)Good–Excellent (70–95%)
Mixed layout (columns, tables)Struggles with layoutUnderstands structure natively
Mathematical notationCannot parseGood for most notation; verify complex LaTeX
Non-Latin scriptsRequires language packsExcellent (trained on multilingual data)
Low quality / degraded scanFails badlyDegrades gracefully; better than OCR
Text overlaid on complex backgroundPoorGood (understands foreground/background)

Always verify critical OCR output

Vision LLMs can misread very small text (<8pt equivalent in the image), certain handwriting styles, and very degraded scans. For medical records, legal documents, or financial data, always have a human verify extracted text before acting on it.

Emerging modalities: audio, video, 3D

The frontier is moving rapidly toward truly universal models that process any combination of inputs. Here is where each modality stands in 2025:

ModalityCapability levelLeading modelKey limitation
Image understandingMatureClaude 3.5, GPT-4o, Gemini 1.5Very small text; fine-grained counting
Speech recognitionMatureWhisper (OpenAI), Gemini AudioNoisy environments; rare accents
Native audio understandingEmergingGPT-4o Audio, Gemini 1.5 ProLatency for real-time; cost
Short video (<1 min)DevelopingGemini 1.5 Pro, GPT-4oTemporal reasoning; object tracking
Long video (hours)Early stageGemini 1.5 Pro (1M ctx)Very expensive; limited availability
3D / point cloudsResearch stageGPT-4o with 3D rendering tricksNo native 3D understanding yet

Video: frames as images

Most video-capable models sample frames at a fixed rate (e.g., 1 frame/second for Gemini 1.5) and process them as a sequence of image embeddings. True native video understanding — tracking objects and events across frames — is still an active research problem. Sora's video generation uses a different architecture: video diffusion transformers (DiT).

Practice questions

  1. What is the difference between late fusion and early fusion in multimodal AI systems? (Answer: Late fusion: each modality is processed independently through its own encoder; outputs are combined (concatenated, averaged, or learned weighted sum) at the final decision layer. Simple, modular, each encoder can be optimised separately. Misses cross-modal interactions during processing. Early fusion: raw inputs from multiple modalities are combined before or early in the neural network. Enables cross-modal feature learning. Harder to train (requires paired multimodal data), but learns richer joint representations.)
  2. What is cross-modal attention and why is it central to transformers like BLIP-2? (Answer: Cross-modal attention: queries from one modality attend to keys/values from another. In BLIP-2: the Querying Transformer (Q-Former) has learnable query tokens that attend (via cross-attention) to frozen image encoder outputs — extracting visual information relevant to the text. The visual tokens produced by Q-Former bridge the frozen image encoder and frozen LLM. Cross-modal attention is what allows the LLM to condition text generation on image content without retraining either component.)
  3. What is the grounding problem in multimodal AI? (Answer: Grounding: connecting language symbols to perceptual content. A model that understands 'red apple' should link the word 'red' to specific wavelengths of reflected light and the visual appearance of that colour. Without grounding, language models manipulate symbols without genuine perceptual reference. Multimodal models achieve partial grounding by training on image-text pairs — the model learns that 'red apple' co-occurs with images containing certain visual patterns. CLIP and similar contrastive models achieve strong perceptual grounding.)
  4. What are the evaluation challenges specific to multimodal generation (image + text)? (Answer: Text evaluation: BLEU, ROUGE, BERTScore — standard NLP metrics. Image-text alignment: CLIPScore measures cosine similarity between generated image and text prompt embeddings. Visual quality: FID (Fréchet Inception Distance) measures distributional similarity to real images. Compositional accuracy: does the image correctly show 'a red ball to the LEFT of a blue cube'? Hard to measure automatically. Human evaluation: costly gold standard. Current models score well on CLIPScore but still struggle with precise spatial relationships and accurate text rendering.)
  5. What is the difference between vision-language models (VLMs) for understanding vs generation? (Answer: VLMs for understanding: image → model → text. Tasks: visual QA, image captioning, optical character recognition, chart understanding. Examples: LLaVA, Idefics, PaliGemma, Claude Vision. Architecture: vision encoder + LLM. VLMs for generation: text → model → image. Tasks: text-to-image synthesis. Examples: DALL-E 3, Stable Diffusion, Midjourney. Architecture: text encoder + diffusion model or autoregressive image generator. Unified models (GPT-4o, Gemini 2.5): both input and output images, bridging understanding and generation in one system.)

On LumiChats

LumiChats supports multi-image attachments in chat. Images are sent to vision-capable models (GPT-4o, Gemini, Claude) via OpenRouter's multimodal API. You can attach multiple images to a single message for comparison analysis.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms