Inference
Learn what Inference means in AI and machine learning, with examples and related concepts.
Definition
Inference is the process of running a trained AI model to generate predictions or outputs. When you send a prompt to Claude and get a response, that’s inference. When Midjourney generates an image from your text, that’s inference. Every time an AI model produces output from input, it’s performing inference.
Inference is distinct from training (where the model learns from data). Training happens once (or periodically) and costs millions of dollars for large models. Inference happens billions of times per day across all users and is the ongoing operational cost of running AI services.
For most developers and companies, inference is what matters — training is done by the model providers (Anthropic, OpenAI, Google). Your job is to run inference efficiently: low latency, reasonable cost, and high throughput.
How It Works
TRAINING (done by model provider):
Trillions of tokens → GPU cluster (months) → Trained model weights
Cost: $10M-$100M+
Happens: Once
INFERENCE (done every time you use AI):
User prompt → Load model → Forward pass → Generated output
Cost: Fractions of a cent per request
Happens: Billions of times per day
Inference pipeline:
Input text: "Explain quantum computing"
↓
Tokenize: [Explain, quantum, computing]
↓
Forward pass through model (billions of matrix multiplications)
↓
Generate tokens one by one: ["Quantum", " computing", " uses", ...]
↓
Detokenize and return response
Key Metrics
| Metric | What It Measures | Typical Values |
|---|---|---|
| Latency | Time to first token (TTFT) | 200ms - 2s |
| Throughput | Tokens generated per second | 30-100 tok/s |
| Tokens per second (TPS) | Total output speed | 50-150 for fast models |
| Cost per token | Price of inference | $0.25-75 per 1M tokens |
Batching
Inference providers process multiple requests simultaneously (batching) to maximize GPU utilization:
Without batching:
Request 1 → [GPU idle 80%] → Response 1
Request 2 → [GPU idle 80%] → Response 2
With batching:
Request 1 ─┐
Request 2 ─┤→ [GPU utilized 90%] → Response 1, Response 2
Request 3 ─┘ Response 3
Why It Matters
- Cost — Inference is the dominant cost for AI applications at scale. A popular chatbot can spend $1M+/month on inference.
- Latency — Users expect fast responses. 2+ seconds of latency feels slow. Optimizing inference speed is critical for UX.
- Scalability — Your app might handle 10 requests/minute in development but 10,000/minute in production. Inference infrastructure must scale.
- Edge deployment — Running inference on devices (phones, laptops) instead of the cloud enables offline AI and reduces latency.
Inference Options
| Option | Cost | Latency | Best For |
|---|---|---|---|
| API (Claude, OpenAI) | Pay per token | Low | Most applications |
| Managed hosting (Bedrock, Vertex) | Pay per token + overhead | Low | Enterprise with cloud commitments |
| Self-hosted (RunPod, Lambda) | GPU rental | Variable | Custom models, high volume |
| Local (Ollama, llama.cpp) | Hardware cost only | Variable | Privacy, development, offline |
| Edge (mobile, browser) | Device compute | Higher | Offline, privacy-critical |
Example
# API inference — the most common approach
from anthropic import Anthropic
import time
client = Anthropic()
start = time.time()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256,
messages=[{"role": "user", "content": "What is inference in AI?"}]
)
elapsed = time.time() - start
print(f"Response: {response.content[0].text[:100]}...")
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Total latency: {elapsed:.2f}s")
print(f"Tokens/second: {response.usage.output_tokens / elapsed:.0f}")
# Local inference with Ollama — run models on your machine
import requests
# Ollama runs a local server on port 11434
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3:8b",
"prompt": "What is inference in AI?",
"stream": False
})
result = response.json()
print(result["response"])
print(f"Eval duration: {result['eval_duration'] / 1e9:.2f}s")
print(f"Tokens/second: {result['eval_count'] / (result['eval_duration'] / 1e9):.0f}")
# Run a model locally with Ollama
ollama pull llama3:8b # Download model (~4.7GB quantized)
ollama run llama3:8b # Interactive chat
# → Running on CPU or GPU, no API costs, no internet required
Key Takeaways
- Inference = running a trained model to generate output. It’s what happens every time you use an AI tool.
- Training costs millions; inference costs fractions of a cent per request — but adds up at scale
- Key metrics: latency (time to respond), throughput (requests per second), cost per token
- Most developers use API inference (Claude, OpenAI). Self-hosting makes sense for high volume or custom models.
- Quantization makes inference cheaper and faster by reducing model precision
Part of the DeepRaft Glossary — AI and ML terms explained for developers.