What is Diffusion Model?

Definition

Diffusion models are a class of generative AI models that create images (or other data) by learning to reverse a gradual noising process. They’re the technology behind Stable Diffusion, FLUX, Midjourney, and DALL-E.

The core idea is elegant: take a clean image, gradually add random noise until it’s pure static, then train a neural network to reverse each step. Once trained, you can start from random noise and iteratively “denoise” it into a coherent image — guided by a text prompt.

Diffusion models have dominated image generation since 2022, producing higher-quality and more controllable results than the GANs (Generative Adversarial Networks) that preceded them. They’re also being applied to video (Runway, Sora), audio (Stable Audio), and 3D generation.

How It Works

TRAINING (Forward Process — add noise):

Step 0        Step 250       Step 500       Step 750       Step 1000
[Clean img] → [Slightly    → [Noisy      → [Very noisy → [Pure
               noisy]         image]         image]        noise]

The model learns to predict the noise added at each step.

GENERATION (Reverse Process — remove noise):

Step 1000     Step 750       Step 500       Step 250       Step 0
[Random    → [Vague       → [Rough      → [Detailed  → [Final
 noise]       shapes]        image]         image]        image]

Starting from noise, the model removes noise step by step,
guided by the text prompt at each step.

Latent Diffusion (used by Stable Diffusion and FLUX)

Working directly on full-resolution images is expensive. Latent diffusion models first compress the image into a smaller latent space using a VAE (Variational Autoencoder), run diffusion there, then decode back:

Text: "A cat wearing sunglasses"
      ↓
Text Encoder (CLIP/T5) → text embeddings
      ↓
Diffusion in latent space (64x64 instead of 512x512)
  ├─ Noise → denoise → denoise → ... → clean latent
  └─ Text embeddings guide each denoising step (cross-attention)
      ↓
VAE Decoder → full-resolution image (512x512 or higher)

This makes training and inference 10-100x more efficient.

Why It Matters

Image quality — Diffusion models produce the highest-quality AI-generated images available today
Controllability — Techniques like ControlNet let you guide generation with poses, edges, or depth maps
Customization — LoRA and fine-tuning let you teach models new styles or subjects with just 20-50 images
Open source — Stable Diffusion and FLUX are open-source, enabling a massive ecosystem of tools and modifications
Beyond images — The same architecture now generates video, audio, and 3D models

Major Diffusion Models (as of 2026)

Model	Creator	Open Source	Strengths
FLUX	Black Forest Labs	Yes	Text rendering, photorealism
Stable Diffusion 3/XL	Stability AI	Yes	Ecosystem, ControlNet support
Midjourney v6	Midjourney	No	Artistic quality, aesthetics
DALL-E 3	OpenAI	No	Prompt adherence, ChatGPT integration
Imagen 3	Google	No	Photorealism, text rendering

Example

# Generate an image with Stable Diffusion using the diffusers library
from diffusers import StableDiffusionPipeline
import torch

# Load the model
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

# Generate an image
image = pipe(
    prompt="A photorealistic cat wearing tiny sunglasses, golden hour lighting",
    negative_prompt="blurry, low quality, distorted",
    num_inference_steps=30,   # more steps = better quality, slower
    guidance_scale=7.5,       # how closely to follow the prompt
).images[0]

image.save("cat_sunglasses.png")

# Using FLUX via the Replicate API
import replicate

output = replicate.run(
    "black-forest-labs/flux-1.1-pro",
    input={
        "prompt": "A photorealistic cat wearing tiny sunglasses, golden hour lighting",
        "aspect_ratio": "1:1",
        "output_quality": 90,
    }
)

# output is a URL to the generated image
print(output)

Diffusion vs GANs vs Autoregressive

	Diffusion	GAN	Autoregressive
Quality	Excellent	Good	Excellent
Diversity	High	Mode collapse risk	High
Training	Stable	Notoriously unstable	Stable
Speed	Slow (many steps)	Fast (single pass)	Very slow (token by token)
Control	Excellent (ControlNet, guidance)	Limited	Good (via prompting)

Key Takeaways

Diffusion models generate images by learning to reverse a noise-adding process — start from noise, denoise step by step
Latent diffusion (Stable Diffusion, FLUX) works in compressed space for 10-100x efficiency
They replaced GANs as the dominant image generation approach due to superior quality and stability
Key parameters: inference steps (quality vs speed), guidance scale (prompt adherence), and negative prompts
LoRA and ControlNet extend their capabilities for custom styles and precise control

Part of the DeepRaft Glossary — AI and ML terms explained for developers.

Diffusion Model