[Paper Summary] Latent Diffusion Models (Stable Diffusion) | The Core Technology of AI Image Generation

[Paper Summary] Latent Diffusion Models (Stable Diffusion) | The Core Technology of AI Image Generation

Paper Info

  • Title: High-Resolution Image Synthesis with Latent Diffusion Models
  • Authors: Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer
  • Published: CVPR 2022 (Oral)
  • arXiv: 2112.10752
  • Also known as: The foundational paper for Stable Diffusion

All Stable Diffusion-family models, including z-image-turbo, are built on the architecture described in this paper. It is the single most important paper for understanding AI image generation.

What is it?

A method that performs the diffusion process in a compressed latent space rather than in pixel space.

Previous diffusion models (DDPM etc.) operated in pixel space, requiring enormous compute to generate high-resolution images. LDM uses a pre-trained AutoEncoder to compress images into low-dimensional latent representations and runs the diffusion process in that latent space, drastically reducing compute while maintaining high image quality.

What makes it better than prior work?

Pixel-space diffusion models (DDPM etc.)

  • Diffusing 512×512 images directly → GPU-days of compute
  • High resolution is practically infeasible
  • High quality but impractical

GAN (Generative Adversarial Networks)

  • Fast generation
  • Risk of mode collapse (stuck generating only certain patterns)
  • Unstable training

LDM’s breakthrough

ComparisonDDPM (pixel space)GANLDM
Compute costVery highLowDrastically reduced
Image qualityHighHighHigh
Training stabilityStableUnstableStable
DiversityHighMode collapseHigh
ConditioningDifficultDifficultFlexible via Cross-Attention

LDM achieves the best of both worlds: the quality and stability of diffusion models with the computational efficiency of GANs.

What’s the core idea?

LDM’s architecture has three components.

1. Perceptual Compression: AutoEncoder

Converts images from pixel space to latent space.

Input image (512×512×3)
    ↓ Encoder E
Latent representation z (64×64×4)  ← diffusion happens here
    ↓ Decoder D
Output image (512×512×3)
  • Compression ratio: 512×512×3 = 786,432 dimensions → 64×64×4 = 16,384 dimensions (~48× compression)
  • Preserves perceptually important information while removing high-frequency noise
  • KL or VQ regularization structures the latent space

This compression drastically reduces the compute cost of the diffusion process.

2. Diffusion in Latent Space: U-Net

Adds noise (diffusion) and removes noise (reverse diffusion) in the compressed latent space.

Forward process (training):

z₀ (original latent)
  → z₁ (add a little noise)
  → z₂ (add more noise)
  → ...
  → z_T (pure noise)

Reverse process (generation):

z_T (random noise)
  → z_{T-1} (remove a little noise)
  → ...
  → z₀ (clean latent)
  → Decoder D → output image

The U-Net predicts “how much noise to remove from this noisy image.”

3. Text Conditioning via Cross-Attention

LDM’s revolutionary contribution is enabling flexible conditioning via Cross-Attention layers.

Prompt "a Japanese woman in a cafe"
    ↓
CLIP text encoder → text embedding (77×768)
    ↓
Cross-Attention at each U-Net layer
    ↓
Text-conditioned denoising

How Cross-Attention works:

  • Query (Q): U-Net intermediate features (image-side information)
  • Key (K): Text embedding (text-side information)
  • Value (V): Text embedding
Attention(Q, K, V) = softmax(QK^T / √d) × V

This lets each spatial position in the U-Net learn “which part of the text to attend to.” For example, the region corresponding to cafe will be generated with cafe-related elements.

For details on CLIP’s text encoder, see the CLIP Paper Summary.

Two-Phase Training

LDM is trained in two phases:

Phase 1: AutoEncoder Training

  • Learns image compression and reconstruction
  • Acquires latent space structure

Phase 2: Diffusion Model Training

  • Freezes the Phase 1 AutoEncoder
  • Learns diffusion and reverse diffusion in latent space
  • Learns text conditioning via Cross-Attention layers

This separation allows each phase to be optimized independently, stabilizing training.

Downsampling Factor Selection

The paper compares different compression ratios (f = 1, 2, 4, 8, 16, 32):

Factor fLatent size (512 input)Result
f = 1512×512Same as pixel space. Slow
f = 4128×128Best quality-speed balance
f = 864×64Fast but some fine detail lost
f = 1632×32Fast but quality drops
f = 3216×16Significant detail loss

Stable Diffusion uses f = 8.

How was it validated?

Quantitative Evaluation

Achieved state-of-the-art or competitive FID on multiple datasets:

DatasetFIDNotes
CelebA-HQ (256×256)5.15Face image generation
FFHQ (256×256)4.98High-quality face images
LSUN-Churches (256×256)4.48Church images
LSUN-Bedrooms (256×256)2.95Bedroom images

Compute Cost Comparison

Dramatically faster inference compared to pixel-space diffusion models:

  • LDM-4: approximately 4–8× faster than pixel-space models
  • LDM-8 (Stable Diffusion): even faster
  • Practical generation speed on a single V100 GPU

Application to Diverse Tasks

LDM was validated on tasks beyond image generation:

  • Text-to-image generation: Text conditioning via Cross-Attention
  • Inpainting: Filling in masked regions
  • Super-resolution: Low-to-high resolution conversion
  • Layout-to-image: Generating images from bounding boxes
  • Semantic image synthesis: Generating images from segmentation maps

Are there limitations?

Limitations

  • Detail reproduction: Compression into latent space can lose fine details (text, fingers, etc.)
  • Latent space bottleneck: High compression = lower quality; low compression = reduced compute savings
  • Two-phase training complexity: AutoEncoder quality affects overall quality
  • Text conditioning limits: CLIP’s 75-token limit constrains long text descriptions

The Finger Problem

The common “wrong number of fingers” issue in AI image generation is partly due to LDM’s latent space compression. Fine structures like fingers are easily lost in the latent space, making accurate reproduction difficult. This is why specifying missing fingers, extra fingers in negative prompts helps.

PaperRelevance
DDPM: Denoising Diffusion Probabilistic Models (Ho et al., 2020)Diffusion model foundations. The method LDM builds upon
CLIP (Radford et al., 2021)Used for text condition encoding → CLIP Paper Summary
Classifier-Free Diffusion Guidance (Ho & Salimans, 2022)Guidance method used with LDM → CFG Paper Summary
SDXL (Podell et al., 2023)Improved LDM. Higher-quality image generation
U-Net (Ronneberger et al., 2015)Architecture used in LDM’s diffusion model

Impact on AI image generation

LDM was released as Stable Diffusion and is the most important paper for democratizing AI image generation. Its open-source release spawned countless derivative models, including z-image-turbo.

The “KSampler” steps and sampler settings you configure in ComfyUI workflows correspond directly to the reverse diffusion process parameters described in this paper.