What is Inference? — AI Glossary

Definition

Inference is the process of running a trained AI model to generate predictions or outputs. When you send a prompt to Claude and get a response, that’s inference. When Midjourney generates an image from your text, that’s inference. Every time an AI model produces output from input, it’s performing inference.

Inference is distinct from training (where the model learns from data). Training happens once (or periodically) and costs millions of dollars for large models. Inference happens billions of times per day across all users and is the ongoing operational cost of running AI services.

For most developers and companies, inference is what matters — training is done by the model providers (Anthropic, OpenAI, Google). Your job is to run inference efficiently: low latency, reasonable cost, and high throughput.

How It Works

TRAINING (done by model provider):
  Trillions of tokens → GPU cluster (months) → Trained model weights
  Cost: $10M-$100M+
  Happens: Once

INFERENCE (done every time you use AI):
  User prompt → Load model → Forward pass → Generated output
  Cost: Fractions of a cent per request
  Happens: Billions of times per day

Inference pipeline:
  Input text: "Explain quantum computing"
       ↓
  Tokenize: [Explain, quantum, computing]
       ↓
  Forward pass through model (billions of matrix multiplications)
       ↓
  Generate tokens one by one: ["Quantum", " computing", " uses", ...]
       ↓
  Detokenize and return response

Key Metrics

Metric	What It Measures	Typical Values
Latency	Time to first token (TTFT)	200ms - 2s
Throughput	Tokens generated per second	30-100 tok/s
Tokens per second (TPS)	Total output speed	50-150 for fast models
Cost per token	Price of inference	$0.25-75 per 1M tokens

Batching

Inference providers process multiple requests simultaneously (batching) to maximize GPU utilization:

Without batching:
  Request 1 → [GPU idle 80%] → Response 1
  Request 2 → [GPU idle 80%] → Response 2

With batching:
  Request 1 ─┐
  Request 2 ─┤→ [GPU utilized 90%] → Response 1, Response 2
  Request 3 ─┘                       Response 3

Why It Matters

Cost — Inference is the dominant cost for AI applications at scale. A popular chatbot can spend $1M+/month on inference.
Latency — Users expect fast responses. 2+ seconds of latency feels slow. Optimizing inference speed is critical for UX.
Scalability — Your app might handle 10 requests/minute in development but 10,000/minute in production. Inference infrastructure must scale.
Edge deployment — Running inference on devices (phones, laptops) instead of the cloud enables offline AI and reduces latency.

Inference Options

Option	Cost	Latency	Best For
API (Claude, OpenAI)	Pay per token	Low	Most applications
Managed hosting (Bedrock, Vertex)	Pay per token + overhead	Low	Enterprise with cloud commitments
Self-hosted (RunPod, Lambda)	GPU rental	Variable	Custom models, high volume
Local (Ollama, llama.cpp)	Hardware cost only	Variable	Privacy, development, offline
Edge (mobile, browser)	Device compute	Higher	Offline, privacy-critical

Example

# API inference — the most common approach
from anthropic import Anthropic
import time

client = Anthropic()

start = time.time()
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=256,
    messages=[{"role": "user", "content": "What is inference in AI?"}]
)
elapsed = time.time() - start

print(f"Response: {response.content[0].text[:100]}...")
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Total latency: {elapsed:.2f}s")
print(f"Tokens/second: {response.usage.output_tokens / elapsed:.0f}")

# Local inference with Ollama — run models on your machine
import requests

# Ollama runs a local server on port 11434
response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3:8b",
    "prompt": "What is inference in AI?",
    "stream": False
})

result = response.json()
print(result["response"])
print(f"Eval duration: {result['eval_duration'] / 1e9:.2f}s")
print(f"Tokens/second: {result['eval_count'] / (result['eval_duration'] / 1e9):.0f}")

# Run a model locally with Ollama
ollama pull llama3:8b        # Download model (~4.7GB quantized)
ollama run llama3:8b         # Interactive chat
# → Running on CPU or GPU, no API costs, no internet required

Key Takeaways

Inference = running a trained model to generate output. It’s what happens every time you use an AI tool.
Training costs millions; inference costs fractions of a cent per request — but adds up at scale
Key metrics: latency (time to respond), throughput (requests per second), cost per token
Most developers use API inference (Claude, OpenAI). Self-hosting makes sense for high volume or custom models.
Quantization makes inference cheaper and faster by reducing model precision

Part of the DeepRaft Glossary — AI and ML terms explained for developers.

Inference