infrastructure

Quantization

Learn what Quantization means in AI and machine learning, with examples and related concepts.

Definition

Quantization is the process of reducing the numerical precision of a model’s weights — typically from 16-bit floating point (FP16) to 8-bit (INT8) or 4-bit (INT4) integers. This makes models smaller, faster, and cheaper to run, with minimal loss in quality.

Think of it like reducing the resolution of an image. A 4K photo (full precision) has incredible detail, but a 1080p version (quantized) looks nearly identical to most viewers and takes up 4x less space. Similarly, a 4-bit quantized Llama 3 70B model runs on a single consumer GPU instead of requiring a $10K+ server setup — while producing nearly the same quality output.

Quantization is what makes local AI possible. Without it, running a 70B parameter model would require 140GB of GPU memory (at FP16). With 4-bit quantization, it fits in 35GB — within reach of a single high-end GPU.

How It Works

Full Precision (FP16) — 16 bits per weight:
  Weight value: 0.123456789...
  Storage: 2 bytes per parameter
  70B model = 140 GB

INT8 Quantization — 8 bits per weight:
  Weight value: 0.12 (rounded)
  Storage: 1 byte per parameter
  70B model = 70 GB (2x smaller)

INT4 Quantization — 4 bits per weight:
  Weight value: 0.1 (more rounding)
  Storage: 0.5 bytes per parameter
  70B model = 35 GB (4x smaller)

Quantization Methods

MethodBitsQuality LossSpeedPopular Format
FP1616None (baseline)BaselineNative PyTorch
INT88Minimal~1.5x fasterbitsandbytes
GPTQ4Small~2-3x faster.safetensors
AWQ4Small~2-3x faster.safetensors
GGUF2-8VariesCPU-friendly.gguf (llama.cpp)

GGUF — The Local AI Standard

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and Ollama. It’s optimized for CPU inference and supports mixed quantization:

Quantization levels in GGUF:
  Q2_K  — 2-bit  (smallest, noticeable quality loss)
  Q4_K_M — 4-bit  (best balance of size/quality)  ← Most popular
  Q5_K_M — 5-bit  (slightly better quality)
  Q6_K  — 6-bit  (near FP16 quality)
  Q8_0  — 8-bit  (minimal quality loss)

Why It Matters

Example

# Run a quantized model with Ollama (simplest approach)
# Ollama automatically uses quantized GGUF models

ollama pull llama3:8b           # 4-bit quantized, ~4.7GB
ollama pull llama3:70b          # 4-bit quantized, ~40GB
ollama pull mistral:7b-q5_K_M   # Specific quantization level

# Compare sizes:
# llama3:8b  FP16 = 16GB  →  Q4 = 4.7GB  (3.4x smaller)
# llama3:70b FP16 = 140GB →  Q4 = 40GB   (3.5x smaller)
# Quantize a model using bitsandbytes (INT8/INT4)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Load a model in 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4",  # NormalFloat4 — best for LLMs
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=quantization_config,
    device_map="auto",
)

# Original: ~16GB in FP16
# Quantized: ~5GB in 4-bit
# Quality: ~98% of original on most benchmarks
# Compare quantization levels programmatically
import os

models = {
    "Q4_K_M": "llama-3-8b-Q4_K_M.gguf",   # 4.7 GB
    "Q5_K_M": "llama-3-8b-Q5_K_M.gguf",   # 5.5 GB
    "Q8_0":   "llama-3-8b-Q8_0.gguf",      # 8.5 GB
}

for quant, filename in models.items():
    size_gb = os.path.getsize(filename) / (1024**3)
    print(f"{quant}: {size_gb:.1f} GB")

# In practice:
# Q4_K_M — Best for most users (good quality, fits consumer GPUs)
# Q5_K_M — Slightly better quality, worth it if you have the RAM
# Q8_0   — Near-perfect quality, use if you have 16GB+ VRAM

Quantization Quality Impact

Real-world benchmark comparison (Llama 3 8B):

QuantizationSizeMMLU ScorePerplexityQuality vs FP16
FP1616.0 GB66.66.14100% (baseline)
Q8_08.5 GB66.46.16~99.7%
Q5_K_M5.5 GB65.86.25~98.8%
Q4_K_M4.7 GB64.96.38~97.4%
Q2_K3.2 GB58.17.89~87.2%

The sweet spot is Q4_K_M — 4x compression with less than 3% quality loss.

Key Takeaways


Part of the DeepRaft Glossary — AI and ML terms explained for developers.