What is ControlNet? — AI Glossary

Definition

ControlNet is a neural network architecture that adds precise spatial control to diffusion models like Stable Diffusion and FLUX. It lets you guide image generation using reference images — edges, poses, depth maps, or sketches — while the text prompt handles the content and style.

Without ControlNet, text-to-image generation is a gamble. You can say “a woman standing in a park” but you can’t control which pose, which angle, or which composition. ControlNet solves this by accepting a conditioning image (like a skeleton pose or edge map) that defines the spatial structure, while the diffusion model fills in the details.

The original ControlNet paper (Zhang et al., 2023) introduced the concept for Stable Diffusion, and the approach has since been adapted for FLUX and other diffusion architectures. It’s become an essential tool for professional AI image work where “close enough” isn’t good enough.

How It Works

Standard diffusion (text only):
  "A woman dancing in a garden" → [unpredictable pose and composition]

With ControlNet:
  Text:    "A woman dancing in a garden"
  Control: [skeleton pose image] or [edge map] or [depth map]
           ↓
  Result:  Image matches the exact pose/structure of the control image
           + follows the text prompt for content and style

Architecture

┌──────────────────────────────────────────────┐
│               Diffusion Model (frozen)        │
│   Text prompt ──→  Cross-attention layers     │
│                          ↑                    │
│               Zero-convolution connections    │
│                          ↑                    │
│              ┌──────────────────────┐         │
│              │   ControlNet (copy)   │         │
│              │   Trainable encoder   │         │
│              │   block               │         │
│              │         ↑             │         │
│              │   Control image       │         │
│              │   (pose/edge/depth)   │         │
│              └──────────────────────┘         │
└──────────────────────────────────────────────┘

Key insight: ControlNet is a trainable copy of the diffusion model's
encoder blocks. It injects spatial information via zero-convolution
layers, so it starts with zero influence and gradually learns
the right amount of control.

Control Types

Control Type	Input	What It Controls	Use Case
Canny Edge	Edge-detected image	Outlines and shapes	Preserving composition
OpenPose	Skeleton keypoints	Body pose and position	Character posing
Depth	Depth map (MiDaS)	Spatial depth/perspective	Scene composition
Scribble	Hand-drawn sketch	Rough layout	Quick concept art
Segmentation	Semantic map	Region layout	Scene design
Normal Map	Surface normals	3D surface detail	Product shots
Lineart	Clean line drawing	Precise outlines	Illustration, coloring

Why It Matters

Precision — Go from “roughly what I wanted” to “exactly the pose/layout I needed”
Consistency — Generate multiple images with the same pose or composition for animation, product shots, or branding
Workflow integration — Designers can sketch layouts and have AI fill in the details
Character consistency — Maintain the same character pose across multiple scenes
Professional use — ControlNet makes AI image generation viable for commercial work where precision is required

Example

# ControlNet with Stable Diffusion using diffusers
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import torch

# Load a ControlNet model (e.g., for Canny edges)
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)

# Create the pipeline
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
).to("cuda")

# Prepare control image (Canny edge detection)
import cv2
import numpy as np
from PIL import Image

original = load_image("reference_photo.png")
image_array = np.array(original)

# Extract edges
edges = cv2.Canny(image_array, 100, 200)
control_image = Image.fromarray(edges)

# Generate with ControlNet
result = pipe(
    prompt="a beautiful anime character, studio lighting, detailed",
    image=control_image,
    num_inference_steps=30,
    controlnet_conditioning_scale=0.8,  # 0.0 = ignore control, 1.0 = strict
).images[0]

result.save("controlnet_output.png")

# OpenPose ControlNet — control body poses
from controlnet_aux import OpenposeDetector

# Extract pose from a reference image
pose_detector = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
reference = load_image("dancer.png")
pose_image = pose_detector(reference)

# The pose_image is a skeleton overlay — stick figure showing the pose
# Now generate a new image with the same pose
controlnet_pose = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-openpose",
    torch_dtype=torch.float16
)

pipe_pose = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet_pose,
    torch_dtype=torch.float16,
).to("cuda")

# Different prompt, same pose
result = pipe_pose(
    prompt="a robot dancing in a futuristic city, cyberpunk style",
    image=pose_image,
    num_inference_steps=30,
).images[0]

result.save("robot_same_pose.png")
# → A cyberpunk robot in the exact same dancing pose as the reference

ControlNet vs Other Control Methods

Method	Precision	Flexibility	Complexity
Text prompt only	Low	High	Simple
ControlNet	High	High	Medium
img2img	Medium	Medium	Simple
Inpainting	High (local)	Medium	Medium
IP-Adapter	Medium (style)	High	Medium

Key Takeaways

ControlNet adds precise spatial control to diffusion models using reference images (poses, edges, depth maps)
It works by injecting spatial information into the diffusion process while keeping the base model frozen
Common control types: Canny edges, OpenPose skeletons, depth maps, scribbles
The controlnet_conditioning_scale parameter (0.0-1.0) controls how strictly the output follows the reference
Essential for professional AI image work where you need consistent, precise compositions

Part of the DeepRaft Glossary — AI and ML terms explained for developers.

ControlNet