infrastructure

Inference

Learn what Inference means in AI and machine learning, with examples and related concepts.

Definition

Inference is the process of running a trained AI model to generate predictions or outputs. When you send a prompt to Claude and get a response, that’s inference. When Midjourney generates an image from your text, that’s inference. Every time an AI model produces output from input, it’s performing inference.

Inference is distinct from training (where the model learns from data). Training happens once (or periodically) and costs millions of dollars for large models. Inference happens billions of times per day across all users and is the ongoing operational cost of running AI services.

For most developers and companies, inference is what matters — training is done by the model providers (Anthropic, OpenAI, Google). Your job is to run inference efficiently: low latency, reasonable cost, and high throughput.

How It Works

TRAINING (done by model provider):
  Trillions of tokens → GPU cluster (months) → Trained model weights
  Cost: $10M-$100M+
  Happens: Once

INFERENCE (done every time you use AI):
  User prompt → Load model → Forward pass → Generated output
  Cost: Fractions of a cent per request
  Happens: Billions of times per day

Inference pipeline:
  Input text: "Explain quantum computing"

  Tokenize: [Explain, quantum, computing]

  Forward pass through model (billions of matrix multiplications)

  Generate tokens one by one: ["Quantum", " computing", " uses", ...]

  Detokenize and return response

Key Metrics

MetricWhat It MeasuresTypical Values
LatencyTime to first token (TTFT)200ms - 2s
ThroughputTokens generated per second30-100 tok/s
Tokens per second (TPS)Total output speed50-150 for fast models
Cost per tokenPrice of inference$0.25-75 per 1M tokens

Batching

Inference providers process multiple requests simultaneously (batching) to maximize GPU utilization:

Without batching:
  Request 1 → [GPU idle 80%] → Response 1
  Request 2 → [GPU idle 80%] → Response 2

With batching:
  Request 1 ─┐
  Request 2 ─┤→ [GPU utilized 90%] → Response 1, Response 2
  Request 3 ─┘                       Response 3

Why It Matters

Inference Options

OptionCostLatencyBest For
API (Claude, OpenAI)Pay per tokenLowMost applications
Managed hosting (Bedrock, Vertex)Pay per token + overheadLowEnterprise with cloud commitments
Self-hosted (RunPod, Lambda)GPU rentalVariableCustom models, high volume
Local (Ollama, llama.cpp)Hardware cost onlyVariablePrivacy, development, offline
Edge (mobile, browser)Device computeHigherOffline, privacy-critical

Example

# API inference — the most common approach
from anthropic import Anthropic
import time

client = Anthropic()

start = time.time()
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=256,
    messages=[{"role": "user", "content": "What is inference in AI?"}]
)
elapsed = time.time() - start

print(f"Response: {response.content[0].text[:100]}...")
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Total latency: {elapsed:.2f}s")
print(f"Tokens/second: {response.usage.output_tokens / elapsed:.0f}")
# Local inference with Ollama — run models on your machine
import requests

# Ollama runs a local server on port 11434
response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3:8b",
    "prompt": "What is inference in AI?",
    "stream": False
})

result = response.json()
print(result["response"])
print(f"Eval duration: {result['eval_duration'] / 1e9:.2f}s")
print(f"Tokens/second: {result['eval_count'] / (result['eval_duration'] / 1e9):.0f}")
# Run a model locally with Ollama
ollama pull llama3:8b        # Download model (~4.7GB quantized)
ollama run llama3:8b         # Interactive chat
# → Running on CPU or GPU, no API costs, no internet required

Key Takeaways


Part of the DeepRaft Glossary — AI and ML terms explained for developers.