image-gen

ControlNet

Learn what ControlNet means in AI and machine learning, with examples and related concepts.

Definition

ControlNet is a neural network architecture that adds precise spatial control to diffusion models like Stable Diffusion and FLUX. It lets you guide image generation using reference images — edges, poses, depth maps, or sketches — while the text prompt handles the content and style.

Without ControlNet, text-to-image generation is a gamble. You can say “a woman standing in a park” but you can’t control which pose, which angle, or which composition. ControlNet solves this by accepting a conditioning image (like a skeleton pose or edge map) that defines the spatial structure, while the diffusion model fills in the details.

The original ControlNet paper (Zhang et al., 2023) introduced the concept for Stable Diffusion, and the approach has since been adapted for FLUX and other diffusion architectures. It’s become an essential tool for professional AI image work where “close enough” isn’t good enough.

How It Works

Standard diffusion (text only):
  "A woman dancing in a garden" → [unpredictable pose and composition]

With ControlNet:
  Text:    "A woman dancing in a garden"
  Control: [skeleton pose image] or [edge map] or [depth map]

  Result:  Image matches the exact pose/structure of the control image
           + follows the text prompt for content and style

Architecture

┌──────────────────────────────────────────────┐
│               Diffusion Model (frozen)        │
│   Text prompt ──→  Cross-attention layers     │
│                          ↑                    │
│               Zero-convolution connections    │
│                          ↑                    │
│              ┌──────────────────────┐         │
│              │   ControlNet (copy)   │         │
│              │   Trainable encoder   │         │
│              │   block               │         │
│              │         ↑             │         │
│              │   Control image       │         │
│              │   (pose/edge/depth)   │         │
│              └──────────────────────┘         │
└──────────────────────────────────────────────┘

Key insight: ControlNet is a trainable copy of the diffusion model's
encoder blocks. It injects spatial information via zero-convolution
layers, so it starts with zero influence and gradually learns
the right amount of control.

Control Types

Control TypeInputWhat It ControlsUse Case
Canny EdgeEdge-detected imageOutlines and shapesPreserving composition
OpenPoseSkeleton keypointsBody pose and positionCharacter posing
DepthDepth map (MiDaS)Spatial depth/perspectiveScene composition
ScribbleHand-drawn sketchRough layoutQuick concept art
SegmentationSemantic mapRegion layoutScene design
Normal MapSurface normals3D surface detailProduct shots
LineartClean line drawingPrecise outlinesIllustration, coloring

Why It Matters

Example

# ControlNet with Stable Diffusion using diffusers
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import torch

# Load a ControlNet model (e.g., for Canny edges)
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)

# Create the pipeline
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
).to("cuda")

# Prepare control image (Canny edge detection)
import cv2
import numpy as np
from PIL import Image

original = load_image("reference_photo.png")
image_array = np.array(original)

# Extract edges
edges = cv2.Canny(image_array, 100, 200)
control_image = Image.fromarray(edges)

# Generate with ControlNet
result = pipe(
    prompt="a beautiful anime character, studio lighting, detailed",
    image=control_image,
    num_inference_steps=30,
    controlnet_conditioning_scale=0.8,  # 0.0 = ignore control, 1.0 = strict
).images[0]

result.save("controlnet_output.png")
# OpenPose ControlNet — control body poses
from controlnet_aux import OpenposeDetector

# Extract pose from a reference image
pose_detector = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
reference = load_image("dancer.png")
pose_image = pose_detector(reference)

# The pose_image is a skeleton overlay — stick figure showing the pose
# Now generate a new image with the same pose
controlnet_pose = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-openpose",
    torch_dtype=torch.float16
)

pipe_pose = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet_pose,
    torch_dtype=torch.float16,
).to("cuda")

# Different prompt, same pose
result = pipe_pose(
    prompt="a robot dancing in a futuristic city, cyberpunk style",
    image=pose_image,
    num_inference_steps=30,
).images[0]

result.save("robot_same_pose.png")
# → A cyberpunk robot in the exact same dancing pose as the reference

ControlNet vs Other Control Methods

MethodPrecisionFlexibilityComplexity
Text prompt onlyLowHighSimple
ControlNetHighHighMedium
img2imgMediumMediumSimple
InpaintingHigh (local)MediumMedium
IP-AdapterMedium (style)HighMedium

Key Takeaways


Part of the DeepRaft Glossary — AI and ML terms explained for developers.