llm

Multimodal

Learn what Multimodal (Multimodal AI) means in AI and machine learning, with examples and related concepts.

Definition

Multimodal AI refers to models that can understand and generate multiple types of data — text, images, audio, video, and code — rather than being limited to a single modality like text-only.

Early LLMs were text-in, text-out. You couldn’t show them an image or play them audio. Modern multimodal models like Claude, GPT-4o, and Gemini can process images alongside text, understand screenshots, read handwritten notes, analyze charts, and even work with audio and video.

This is a fundamental shift in how AI is used. Instead of describing a problem in words, you can show the model a screenshot of an error, a photo of a whiteboard sketch, or a chart from a dashboard. Multimodal models understand the world more like humans do — through multiple senses, not just language.

How It Works

UNIMODAL (text only):
  Text → [LLM] → Text

MULTIMODAL (multiple modalities):
  Text   ──┐
  Image  ──┤→ [Multimodal Model] → Text / Image / Audio
  Audio  ──┘

How models process different modalities:

  Image input:
    Image → Vision Encoder (ViT) → Image tokens → Transformer
    "A photo of a cat" becomes ~1,000 tokens the model can attend to

  Audio input:
    Audio → Audio Encoder (Whisper) → Audio tokens → Transformer
    Speech becomes tokens just like text

  All modalities become tokens in the same space,
  so the model can reason across them using attention.

Input vs Output Modalities

ModelText InImage InAudio InText OutImage OutAudio Out
ClaudeYesYesNoYesNoNo
GPT-4oYesYesYesYesYesYes
GeminiYesYesYesYesYesNo
Llama 3YesYesNoYesNoNo

Why It Matters

Example

# Claude vision — analyze an image
from anthropic import Anthropic
import base64

client = Anthropic()

# Read an image file
with open("architecture_diagram.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data,
                },
            },
            {
                "type": "text",
                "text": "Explain this architecture diagram. What are the main components and how do they interact?"
            }
        ],
    }]
)

print(response.content[0].text)
# → Detailed explanation of the architecture shown in the image
# Practical: Analyze a chart/graph
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": chart_image_data,
                },
            },
            {
                "type": "text",
                "text": "Extract the data from this bar chart as a markdown table. Include all values."
            }
        ],
    }]
)
# → Markdown table with extracted data from the chart
# Multiple images in one request
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": before_image}},
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": after_image}},
            {"type": "text", "text": "Compare these two UI designs. What changed between v1 and v2?"}
        ],
    }]
)
# → Detailed comparison of the two designs

Common Multimodal Use Cases

Use CaseInputWhat the Model Does
Bug reportsScreenshot + descriptionIdentifies the error and suggests fixes
Document processingScanned PDF imageExtracts text, tables, and structure
Code from designUI mockup imageGenerates HTML/CSS matching the design
Data extractionChart/graph imageConverts visual data to structured format
AccessibilityImageGenerates detailed alt-text descriptions
Visual QAPhoto + questionAnswers questions about image content

Key Takeaways


Part of the DeepRaft Glossary — AI and ML terms explained for developers.