ml-fundamentals

RLHF

Learn what RLHF (Reinforcement Learning from Human Feedback) means in AI and machine learning, with examples and related concepts.

Definition

RLHF stands for Reinforcement Learning from Human Feedback — a training technique that aligns LLMs with human preferences by using human judgments as a training signal.

Here’s the core problem RLHF solves: a base LLM trained on internet text is great at predicting the next word, but it’s not great at being helpful, honest, or safe. It might generate toxic content, confidently make things up, or refuse to answer simple questions. RLHF teaches the model to behave the way humans actually want it to.

RLHF is what turned GPT-3 (impressive but erratic) into ChatGPT (useful and conversational). It’s also a key part of how Claude, Gemini, and every major chatbot is trained. Without RLHF (or similar alignment techniques), LLMs would be powerful but unreliable.

How It Works

RLHF is a three-stage process that happens after the base model is pre-trained:

Stage 1: SUPERVISED FINE-TUNING (SFT)
  ┌─────────────────────────────────────────────┐
  │ Human trainers write ideal responses          │
  │ to a set of prompts                           │
  │                                               │
  │ Prompt: "Explain gravity to a 5-year-old"    │
  │ Human: "Imagine you're holding a ball..."     │
  │                                               │
  │ → Fine-tune the base model on these examples  │
  └─────────────────────────────────────────────┘

Stage 2: REWARD MODEL TRAINING
  ┌─────────────────────────────────────────────┐
  │ Generate multiple responses to each prompt    │
  │ Humans rank them: Response A > Response B     │
  │                                               │
  │ Response A: "Gravity is like a magnet..."    │
  │ Response B: "Gravitational force is 9.8m/s²" │
  │ Human ranks: A > B (better for a 5-year-old) │
  │                                               │
  │ → Train a reward model to predict these ranks │
  └─────────────────────────────────────────────┘

Stage 3: REINFORCEMENT LEARNING (PPO)
  ┌─────────────────────────────────────────────┐
  │ The LLM generates responses                   │
  │ The reward model scores them                   │
  │ The LLM is updated to maximize reward scores   │
  │                                               │
  │ Loop: generate → score → update → repeat      │
  │ (thousands of iterations)                      │
  └─────────────────────────────────────────────┘

The reward model is the crucial piece — it acts as a scalable proxy for human judgment. Instead of needing a human to evaluate every response, the reward model learns to predict what humans would prefer.

Why It Matters

The Alignment Problem

RLHF is part of the broader AI alignment challenge: ensuring AI systems do what humans want. Anthropic (Claude’s creator) has been particularly focused on this, developing Constitutional AI (CAI) — a variation where the model is trained against a set of principles rather than just human rankings.

RLHF vs Other Alignment Methods

MethodHow It WorksUsed By
RLHFHuman rankings → reward model → RLChatGPT, Gemini
RLAIFAI rankings → reward model → RLClaude (partially)
Constitutional AIPrinciples + self-critique → RLClaude
DPODirect preference optimization (no reward model)Llama, Mixtral

Example

# While you can't run RLHF yourself easily, here's the conceptual flow
# using the trl (Transformer Reinforcement Learning) library

# Stage 1: Supervised Fine-tuning
from trl import SFTTrainer

sft_trainer = SFTTrainer(
    model=base_model,
    train_dataset=human_written_examples,
    # ... trains the model on ideal responses
)
sft_trainer.train()

# Stage 2: Reward Model Training
from trl import RewardTrainer

reward_trainer = RewardTrainer(
    model=reward_model,
    train_dataset=human_ranked_pairs,
    # Each example: (prompt, chosen_response, rejected_response)
)
reward_trainer.train()

# Stage 3: PPO Training
from trl import PPOTrainer

ppo_trainer = PPOTrainer(
    model=sft_model,
    reward_model=reward_model,
    # ... optimizes the model to maximize reward scores
)

for batch in prompts:
    responses = ppo_trainer.generate(batch)
    rewards = reward_model.score(batch, responses)
    ppo_trainer.step(batch, responses, rewards)
# You can see the effect of RLHF by comparing base vs chat models
from anthropic import Anthropic

client = Anthropic()

# Claude (RLHF-trained) responds helpfully and safely
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=200,
    messages=[{"role": "user", "content": "How do I pick a lock?"}]
)
# → Discusses locksmithing as a profession, recommends contacting a locksmith
# → Does NOT provide instructions for illegal entry
# This helpful-but-safe behavior is the result of RLHF alignment

Key Takeaways


Part of the DeepRaft Glossary — AI and ML terms explained for developers.