[Paper Summary] Latent Diffusion Models (Stable Diffusion) | The Core Technology of AI Image Generation

Paper Info
Title: High-Resolution Image Synthesis with Latent Diffusion Models
Authors: Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer
Published: CVPR 2022 (Oral)
arXiv: 2112.10752
Also known as: The foundational paper for Stable Diffusion

All Stable Diffusion-family models, including z-image-turbo, are built on the architecture described in this paper. It is the single most important paper for understanding AI image generation.

What is it?

A method that performs the diffusion process in a compressed latent space rather than in pixel space.

Previous diffusion models (DDPM etc.) operated in pixel space, requiring enormous compute to generate high-resolution images. LDM uses a pre-trained AutoEncoder to compress images into low-dimensional latent representations and runs the diffusion process in that latent space, drastically reducing compute while maintaining high image quality.

What makes it better than prior work?

Pixel-space diffusion models (DDPM etc.)

Diffusing 512×512 images directly → GPU-days of compute
High resolution is practically infeasible
High quality but impractical

GAN (Generative Adversarial Networks)

Fast generation
Risk of mode collapse (stuck generating only certain patterns)
Unstable training

LDM’s breakthrough

Comparison	DDPM (pixel space)	GAN	LDM
Compute cost	Very high	Low	Drastically reduced
Image quality	High	High	High
Training stability	Stable	Unstable	Stable
Diversity	High	Mode collapse	High
Conditioning	Difficult	Difficult	Flexible via Cross-Attention

LDM achieves the best of both worlds: the quality and stability of diffusion models with the computational efficiency of GANs.

What’s the core idea?

LDM’s architecture has three components.

1. Perceptual Compression: AutoEncoder

Converts images from pixel space to latent space.

Input image (512×512×3)
    ↓ Encoder E
Latent representation z (64×64×4)  ← diffusion happens here
    ↓ Decoder D
Output image (512×512×3)

Compression ratio: 512×512×3 = 786,432 dimensions → 64×64×4 = 16,384 dimensions (~48× compression)
Preserves perceptually important information while removing high-frequency noise
KL or VQ regularization structures the latent space

This compression drastically reduces the compute cost of the diffusion process.

2. Diffusion in Latent Space: U-Net

Adds noise (diffusion) and removes noise (reverse diffusion) in the compressed latent space.

Forward process (training):

z₀ (original latent)
  → z₁ (add a little noise)
  → z₂ (add more noise)
  → ...
  → z_T (pure noise)

Reverse process (generation):

z_T (random noise)
  → z_{T-1} (remove a little noise)
  → ...
  → z₀ (clean latent)
  → Decoder D → output image

The U-Net predicts “how much noise to remove from this noisy image.”

3. Text Conditioning via Cross-Attention

LDM’s revolutionary contribution is enabling flexible conditioning via Cross-Attention layers.

Prompt "a Japanese woman in a cafe"
    ↓
CLIP text encoder → text embedding (77×768)
    ↓
Cross-Attention at each U-Net layer
    ↓
Text-conditioned denoising

How Cross-Attention works:

Query (Q): U-Net intermediate features (image-side information)
Key (K): Text embedding (text-side information)
Value (V): Text embedding

Attention(Q, K, V) = softmax(QK^T / √d) × V

This lets each spatial position in the U-Net learn “which part of the text to attend to.” For example, the region corresponding to cafe will be generated with cafe-related elements.

For details on CLIP’s text encoder, see the CLIP Paper Summary.

Two-Phase Training

LDM is trained in two phases:

Phase 1: AutoEncoder Training

Learns image compression and reconstruction
Acquires latent space structure

Phase 2: Diffusion Model Training

Freezes the Phase 1 AutoEncoder
Learns diffusion and reverse diffusion in latent space
Learns text conditioning via Cross-Attention layers

This separation allows each phase to be optimized independently, stabilizing training.

Downsampling Factor Selection

The paper compares different compression ratios (f = 1, 2, 4, 8, 16, 32):

Factor f	Latent size (512 input)	Result
f = 1	512×512	Same as pixel space. Slow
f = 4	128×128	Best quality-speed balance
f = 8	64×64	Fast but some fine detail lost
f = 16	32×32	Fast but quality drops
f = 32	16×16	Significant detail loss

Stable Diffusion uses f = 8.

How was it validated?

Quantitative Evaluation

Achieved state-of-the-art or competitive FID on multiple datasets:

Dataset	FID	Notes
CelebA-HQ (256×256)	5.15	Face image generation
FFHQ (256×256)	4.98	High-quality face images
LSUN-Churches (256×256)	4.48	Church images
LSUN-Bedrooms (256×256)	2.95	Bedroom images

Compute Cost Comparison

Dramatically faster inference compared to pixel-space diffusion models:

LDM-4: approximately 4–8× faster than pixel-space models
LDM-8 (Stable Diffusion): even faster
Practical generation speed on a single V100 GPU

Application to Diverse Tasks

LDM was validated on tasks beyond image generation:

Text-to-image generation: Text conditioning via Cross-Attention
Inpainting: Filling in masked regions
Super-resolution: Low-to-high resolution conversion
Layout-to-image: Generating images from bounding boxes
Semantic image synthesis: Generating images from segmentation maps

Are there limitations?

Limitations

Detail reproduction: Compression into latent space can lose fine details (text, fingers, etc.)
Latent space bottleneck: High compression = lower quality; low compression = reduced compute savings
Two-phase training complexity: AutoEncoder quality affects overall quality
Text conditioning limits: CLIP’s 75-token limit constrains long text descriptions

The Finger Problem

The common “wrong number of fingers” issue in AI image generation is partly due to LDM’s latent space compression. Fine structures like fingers are easily lost in the latent space, making accurate reproduction difficult. This is why specifying missing fingers, extra fingers in negative prompts helps.

What to read next

Paper	Relevance
DDPM: Denoising Diffusion Probabilistic Models (Ho et al., 2020)	Diffusion model foundations. The method LDM builds upon
CLIP (Radford et al., 2021)	Used for text condition encoding → CLIP Paper Summary
Classifier-Free Diffusion Guidance (Ho & Salimans, 2022)	Guidance method used with LDM → CFG Paper Summary
SDXL (Podell et al., 2023)	Improved LDM. Higher-quality image generation
U-Net (Ronneberger et al., 2015)	Architecture used in LDM’s diffusion model

Impact on AI image generation

LDM was released as Stable Diffusion and is the most important paper for democratizing AI image generation. Its open-source release spawned countless derivative models, including z-image-turbo.

The “KSampler” steps and sampler settings you configure in ComfyUI workflows correspond directly to the reverse diffusion process parameters described in this paper.

What is z-image-turbo — An LDM-based high-speed image generation model
ComfyUI Workflow — UI for controlling LDM parameters
Prompt Basics — CLIP token limits and prompt word order
CLIP Paper Summary — The foundation of text conditioning
CFG Paper Summary — The theory behind negative prompts