image-gen

Diffusion Model

Learn what Diffusion Model means in AI and machine learning, with examples and related concepts.

Definition

Diffusion models are a class of generative AI models that create images (or other data) by learning to reverse a gradual noising process. They’re the technology behind Stable Diffusion, FLUX, Midjourney, and DALL-E.

The core idea is elegant: take a clean image, gradually add random noise until it’s pure static, then train a neural network to reverse each step. Once trained, you can start from random noise and iteratively “denoise” it into a coherent image — guided by a text prompt.

Diffusion models have dominated image generation since 2022, producing higher-quality and more controllable results than the GANs (Generative Adversarial Networks) that preceded them. They’re also being applied to video (Runway, Sora), audio (Stable Audio), and 3D generation.

How It Works

TRAINING (Forward Process — add noise):

Step 0        Step 250       Step 500       Step 750       Step 1000
[Clean img] → [Slightly    → [Noisy      → [Very noisy → [Pure
               noisy]         image]         image]        noise]

The model learns to predict the noise added at each step.

GENERATION (Reverse Process — remove noise):

Step 1000     Step 750       Step 500       Step 250       Step 0
[Random    → [Vague       → [Rough      → [Detailed  → [Final
 noise]       shapes]        image]         image]        image]

Starting from noise, the model removes noise step by step,
guided by the text prompt at each step.

Latent Diffusion (used by Stable Diffusion and FLUX)

Working directly on full-resolution images is expensive. Latent diffusion models first compress the image into a smaller latent space using a VAE (Variational Autoencoder), run diffusion there, then decode back:

Text: "A cat wearing sunglasses"

Text Encoder (CLIP/T5) → text embeddings

Diffusion in latent space (64x64 instead of 512x512)
  ├─ Noise → denoise → denoise → ... → clean latent
  └─ Text embeddings guide each denoising step (cross-attention)

VAE Decoder → full-resolution image (512x512 or higher)

This makes training and inference 10-100x more efficient.

Why It Matters

Major Diffusion Models (as of 2026)

ModelCreatorOpen SourceStrengths
FLUXBlack Forest LabsYesText rendering, photorealism
Stable Diffusion 3/XLStability AIYesEcosystem, ControlNet support
Midjourney v6MidjourneyNoArtistic quality, aesthetics
DALL-E 3OpenAINoPrompt adherence, ChatGPT integration
Imagen 3GoogleNoPhotorealism, text rendering

Example

# Generate an image with Stable Diffusion using the diffusers library
from diffusers import StableDiffusionPipeline
import torch

# Load the model
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

# Generate an image
image = pipe(
    prompt="A photorealistic cat wearing tiny sunglasses, golden hour lighting",
    negative_prompt="blurry, low quality, distorted",
    num_inference_steps=30,   # more steps = better quality, slower
    guidance_scale=7.5,       # how closely to follow the prompt
).images[0]

image.save("cat_sunglasses.png")
# Using FLUX via the Replicate API
import replicate

output = replicate.run(
    "black-forest-labs/flux-1.1-pro",
    input={
        "prompt": "A photorealistic cat wearing tiny sunglasses, golden hour lighting",
        "aspect_ratio": "1:1",
        "output_quality": 90,
    }
)

# output is a URL to the generated image
print(output)

Diffusion vs GANs vs Autoregressive

DiffusionGANAutoregressive
QualityExcellentGoodExcellent
DiversityHighMode collapse riskHigh
TrainingStableNotoriously unstableStable
SpeedSlow (many steps)Fast (single pass)Very slow (token by token)
ControlExcellent (ControlNet, guidance)LimitedGood (via prompting)

Key Takeaways


Part of the DeepRaft Glossary — AI and ML terms explained for developers.