Diffusion Model
Learn what Diffusion Model means in AI and machine learning, with examples and related concepts.
Definition
Diffusion models are a class of generative AI models that create images (or other data) by learning to reverse a gradual noising process. They’re the technology behind Stable Diffusion, FLUX, Midjourney, and DALL-E.
The core idea is elegant: take a clean image, gradually add random noise until it’s pure static, then train a neural network to reverse each step. Once trained, you can start from random noise and iteratively “denoise” it into a coherent image — guided by a text prompt.
Diffusion models have dominated image generation since 2022, producing higher-quality and more controllable results than the GANs (Generative Adversarial Networks) that preceded them. They’re also being applied to video (Runway, Sora), audio (Stable Audio), and 3D generation.
How It Works
TRAINING (Forward Process — add noise):
Step 0 Step 250 Step 500 Step 750 Step 1000
[Clean img] → [Slightly → [Noisy → [Very noisy → [Pure
noisy] image] image] noise]
The model learns to predict the noise added at each step.
GENERATION (Reverse Process — remove noise):
Step 1000 Step 750 Step 500 Step 250 Step 0
[Random → [Vague → [Rough → [Detailed → [Final
noise] shapes] image] image] image]
Starting from noise, the model removes noise step by step,
guided by the text prompt at each step.
Latent Diffusion (used by Stable Diffusion and FLUX)
Working directly on full-resolution images is expensive. Latent diffusion models first compress the image into a smaller latent space using a VAE (Variational Autoencoder), run diffusion there, then decode back:
Text: "A cat wearing sunglasses"
↓
Text Encoder (CLIP/T5) → text embeddings
↓
Diffusion in latent space (64x64 instead of 512x512)
├─ Noise → denoise → denoise → ... → clean latent
└─ Text embeddings guide each denoising step (cross-attention)
↓
VAE Decoder → full-resolution image (512x512 or higher)
This makes training and inference 10-100x more efficient.
Why It Matters
- Image quality — Diffusion models produce the highest-quality AI-generated images available today
- Controllability — Techniques like ControlNet let you guide generation with poses, edges, or depth maps
- Customization — LoRA and fine-tuning let you teach models new styles or subjects with just 20-50 images
- Open source — Stable Diffusion and FLUX are open-source, enabling a massive ecosystem of tools and modifications
- Beyond images — The same architecture now generates video, audio, and 3D models
Major Diffusion Models (as of 2026)
| Model | Creator | Open Source | Strengths |
|---|---|---|---|
| FLUX | Black Forest Labs | Yes | Text rendering, photorealism |
| Stable Diffusion 3/XL | Stability AI | Yes | Ecosystem, ControlNet support |
| Midjourney v6 | Midjourney | No | Artistic quality, aesthetics |
| DALL-E 3 | OpenAI | No | Prompt adherence, ChatGPT integration |
| Imagen 3 | No | Photorealism, text rendering |
Example
# Generate an image with Stable Diffusion using the diffusers library
from diffusers import StableDiffusionPipeline
import torch
# Load the model
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
# Generate an image
image = pipe(
prompt="A photorealistic cat wearing tiny sunglasses, golden hour lighting",
negative_prompt="blurry, low quality, distorted",
num_inference_steps=30, # more steps = better quality, slower
guidance_scale=7.5, # how closely to follow the prompt
).images[0]
image.save("cat_sunglasses.png")
# Using FLUX via the Replicate API
import replicate
output = replicate.run(
"black-forest-labs/flux-1.1-pro",
input={
"prompt": "A photorealistic cat wearing tiny sunglasses, golden hour lighting",
"aspect_ratio": "1:1",
"output_quality": 90,
}
)
# output is a URL to the generated image
print(output)
Diffusion vs GANs vs Autoregressive
| Diffusion | GAN | Autoregressive | |
|---|---|---|---|
| Quality | Excellent | Good | Excellent |
| Diversity | High | Mode collapse risk | High |
| Training | Stable | Notoriously unstable | Stable |
| Speed | Slow (many steps) | Fast (single pass) | Very slow (token by token) |
| Control | Excellent (ControlNet, guidance) | Limited | Good (via prompting) |
Key Takeaways
- Diffusion models generate images by learning to reverse a noise-adding process — start from noise, denoise step by step
- Latent diffusion (Stable Diffusion, FLUX) works in compressed space for 10-100x efficiency
- They replaced GANs as the dominant image generation approach due to superior quality and stability
- Key parameters: inference steps (quality vs speed), guidance scale (prompt adherence), and negative prompts
- LoRA and ControlNet extend their capabilities for custom styles and precise control
Part of the DeepRaft Glossary — AI and ML terms explained for developers.