ControlNet
Learn what ControlNet means in AI and machine learning, with examples and related concepts.
Definition
ControlNet is a neural network architecture that adds precise spatial control to diffusion models like Stable Diffusion and FLUX. It lets you guide image generation using reference images — edges, poses, depth maps, or sketches — while the text prompt handles the content and style.
Without ControlNet, text-to-image generation is a gamble. You can say “a woman standing in a park” but you can’t control which pose, which angle, or which composition. ControlNet solves this by accepting a conditioning image (like a skeleton pose or edge map) that defines the spatial structure, while the diffusion model fills in the details.
The original ControlNet paper (Zhang et al., 2023) introduced the concept for Stable Diffusion, and the approach has since been adapted for FLUX and other diffusion architectures. It’s become an essential tool for professional AI image work where “close enough” isn’t good enough.
How It Works
Standard diffusion (text only):
"A woman dancing in a garden" → [unpredictable pose and composition]
With ControlNet:
Text: "A woman dancing in a garden"
Control: [skeleton pose image] or [edge map] or [depth map]
↓
Result: Image matches the exact pose/structure of the control image
+ follows the text prompt for content and style
Architecture
┌──────────────────────────────────────────────┐
│ Diffusion Model (frozen) │
│ Text prompt ──→ Cross-attention layers │
│ ↑ │
│ Zero-convolution connections │
│ ↑ │
│ ┌──────────────────────┐ │
│ │ ControlNet (copy) │ │
│ │ Trainable encoder │ │
│ │ block │ │
│ │ ↑ │ │
│ │ Control image │ │
│ │ (pose/edge/depth) │ │
│ └──────────────────────┘ │
└──────────────────────────────────────────────┘
Key insight: ControlNet is a trainable copy of the diffusion model's
encoder blocks. It injects spatial information via zero-convolution
layers, so it starts with zero influence and gradually learns
the right amount of control.
Control Types
| Control Type | Input | What It Controls | Use Case |
|---|---|---|---|
| Canny Edge | Edge-detected image | Outlines and shapes | Preserving composition |
| OpenPose | Skeleton keypoints | Body pose and position | Character posing |
| Depth | Depth map (MiDaS) | Spatial depth/perspective | Scene composition |
| Scribble | Hand-drawn sketch | Rough layout | Quick concept art |
| Segmentation | Semantic map | Region layout | Scene design |
| Normal Map | Surface normals | 3D surface detail | Product shots |
| Lineart | Clean line drawing | Precise outlines | Illustration, coloring |
Why It Matters
- Precision — Go from “roughly what I wanted” to “exactly the pose/layout I needed”
- Consistency — Generate multiple images with the same pose or composition for animation, product shots, or branding
- Workflow integration — Designers can sketch layouts and have AI fill in the details
- Character consistency — Maintain the same character pose across multiple scenes
- Professional use — ControlNet makes AI image generation viable for commercial work where precision is required
Example
# ControlNet with Stable Diffusion using diffusers
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import torch
# Load a ControlNet model (e.g., for Canny edges)
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny",
torch_dtype=torch.float16
)
# Create the pipeline
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16,
).to("cuda")
# Prepare control image (Canny edge detection)
import cv2
import numpy as np
from PIL import Image
original = load_image("reference_photo.png")
image_array = np.array(original)
# Extract edges
edges = cv2.Canny(image_array, 100, 200)
control_image = Image.fromarray(edges)
# Generate with ControlNet
result = pipe(
prompt="a beautiful anime character, studio lighting, detailed",
image=control_image,
num_inference_steps=30,
controlnet_conditioning_scale=0.8, # 0.0 = ignore control, 1.0 = strict
).images[0]
result.save("controlnet_output.png")
# OpenPose ControlNet — control body poses
from controlnet_aux import OpenposeDetector
# Extract pose from a reference image
pose_detector = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
reference = load_image("dancer.png")
pose_image = pose_detector(reference)
# The pose_image is a skeleton overlay — stick figure showing the pose
# Now generate a new image with the same pose
controlnet_pose = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-openpose",
torch_dtype=torch.float16
)
pipe_pose = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
controlnet=controlnet_pose,
torch_dtype=torch.float16,
).to("cuda")
# Different prompt, same pose
result = pipe_pose(
prompt="a robot dancing in a futuristic city, cyberpunk style",
image=pose_image,
num_inference_steps=30,
).images[0]
result.save("robot_same_pose.png")
# → A cyberpunk robot in the exact same dancing pose as the reference
ControlNet vs Other Control Methods
| Method | Precision | Flexibility | Complexity |
|---|---|---|---|
| Text prompt only | Low | High | Simple |
| ControlNet | High | High | Medium |
| img2img | Medium | Medium | Simple |
| Inpainting | High (local) | Medium | Medium |
| IP-Adapter | Medium (style) | High | Medium |
Key Takeaways
- ControlNet adds precise spatial control to diffusion models using reference images (poses, edges, depth maps)
- It works by injecting spatial information into the diffusion process while keeping the base model frozen
- Common control types: Canny edges, OpenPose skeletons, depth maps, scribbles
- The
controlnet_conditioning_scaleparameter (0.0-1.0) controls how strictly the output follows the reference - Essential for professional AI image work where you need consistent, precise compositions
Part of the DeepRaft Glossary — AI and ML terms explained for developers.