What is Quantization? — AI Glossary

Definition

Quantization is the process of reducing the numerical precision of a model’s weights — typically from 16-bit floating point (FP16) to 8-bit (INT8) or 4-bit (INT4) integers. This makes models smaller, faster, and cheaper to run, with minimal loss in quality.

Think of it like reducing the resolution of an image. A 4K photo (full precision) has incredible detail, but a 1080p version (quantized) looks nearly identical to most viewers and takes up 4x less space. Similarly, a 4-bit quantized Llama 3 70B model runs on a single consumer GPU instead of requiring a $10K+ server setup — while producing nearly the same quality output.

Quantization is what makes local AI possible. Without it, running a 70B parameter model would require 140GB of GPU memory (at FP16). With 4-bit quantization, it fits in 35GB — within reach of a single high-end GPU.

How It Works

Full Precision (FP16) — 16 bits per weight:
  Weight value: 0.123456789...
  Storage: 2 bytes per parameter
  70B model = 140 GB

INT8 Quantization — 8 bits per weight:
  Weight value: 0.12 (rounded)
  Storage: 1 byte per parameter
  70B model = 70 GB (2x smaller)

INT4 Quantization — 4 bits per weight:
  Weight value: 0.1 (more rounding)
  Storage: 0.5 bytes per parameter
  70B model = 35 GB (4x smaller)

Quantization Methods

Method	Bits	Quality Loss	Speed	Popular Format
FP16	16	None (baseline)	Baseline	Native PyTorch
INT8	8	Minimal	~1.5x faster	bitsandbytes
GPTQ	4	Small	~2-3x faster	.safetensors
AWQ	4	Small	~2-3x faster	.safetensors
GGUF	2-8	Varies	CPU-friendly	.gguf (llama.cpp)

GGUF — The Local AI Standard

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and Ollama. It’s optimized for CPU inference and supports mixed quantization:

Quantization levels in GGUF:
  Q2_K  — 2-bit  (smallest, noticeable quality loss)
  Q4_K_M — 4-bit  (best balance of size/quality)  ← Most popular
  Q5_K_M — 5-bit  (slightly better quality)
  Q6_K  — 6-bit  (near FP16 quality)
  Q8_0  — 8-bit  (minimal quality loss)

Why It Matters

Local AI — Run LLMs on laptops and consumer GPUs instead of cloud data centers
Cost reduction — Smaller models = less GPU memory = cheaper inference
Speed — Lower precision = faster matrix math = faster response times
Privacy — Run models locally without sending data to external APIs
Democratization — Makes powerful AI accessible to individuals and small teams, not just companies with $100K GPU budgets

Example

# Run a quantized model with Ollama (simplest approach)
# Ollama automatically uses quantized GGUF models

ollama pull llama3:8b           # 4-bit quantized, ~4.7GB
ollama pull llama3:70b          # 4-bit quantized, ~40GB
ollama pull mistral:7b-q5_K_M   # Specific quantization level

# Compare sizes:
# llama3:8b  FP16 = 16GB  →  Q4 = 4.7GB  (3.4x smaller)
# llama3:70b FP16 = 140GB →  Q4 = 40GB   (3.5x smaller)

# Quantize a model using bitsandbytes (INT8/INT4)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Load a model in 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4",  # NormalFloat4 — best for LLMs
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=quantization_config,
    device_map="auto",
)

# Original: ~16GB in FP16
# Quantized: ~5GB in 4-bit
# Quality: ~98% of original on most benchmarks

# Compare quantization levels programmatically
import os

models = {
    "Q4_K_M": "llama-3-8b-Q4_K_M.gguf",   # 4.7 GB
    "Q5_K_M": "llama-3-8b-Q5_K_M.gguf",   # 5.5 GB
    "Q8_0":   "llama-3-8b-Q8_0.gguf",      # 8.5 GB
}

for quant, filename in models.items():
    size_gb = os.path.getsize(filename) / (1024**3)
    print(f"{quant}: {size_gb:.1f} GB")

# In practice:
# Q4_K_M — Best for most users (good quality, fits consumer GPUs)
# Q5_K_M — Slightly better quality, worth it if you have the RAM
# Q8_0   — Near-perfect quality, use if you have 16GB+ VRAM

Quantization Quality Impact

Real-world benchmark comparison (Llama 3 8B):

Quantization	Size	MMLU Score	Perplexity	Quality vs FP16
FP16	16.0 GB	66.6	6.14	100% (baseline)
Q8_0	8.5 GB	66.4	6.16	~99.7%
Q5_K_M	5.5 GB	65.8	6.25	~98.8%
Q4_K_M	4.7 GB	64.9	6.38	~97.4%
Q2_K	3.2 GB	58.1	7.89	~87.2%

The sweet spot is Q4_K_M — 4x compression with less than 3% quality loss.

Key Takeaways

Quantization reduces model precision (16-bit → 4-bit) for 3-4x smaller size and faster inference
The quality loss at 4-bit is surprisingly small — under 3% on most benchmarks
GGUF + llama.cpp/Ollama is the standard for running quantized models locally
Q4_K_M is the most popular quantization level — best balance of quality and size
Quantization is what makes local AI practical on consumer hardware

Part of the DeepRaft Glossary — AI and ML terms explained for developers.

Quantization