Paper Info
- Title: High-Resolution Image Synthesis with Latent Diffusion Models
- Authors: Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer
- Published: CVPR 2022 (Oral)
- arXiv: 2112.10752
- Also known as: The foundational paper for Stable Diffusion
All Stable Diffusion-family models, including z-image-turbo, are built on the architecture described in this paper. It is the single most important paper for understanding AI image generation.
What is it?
A method that performs the diffusion process in a compressed latent space rather than in pixel space.
Previous diffusion models (DDPM etc.) operated in pixel space, requiring enormous compute to generate high-resolution images. LDM uses a pre-trained AutoEncoder to compress images into low-dimensional latent representations and runs the diffusion process in that latent space, drastically reducing compute while maintaining high image quality.
What makes it better than prior work?
Pixel-space diffusion models (DDPM etc.)
- Diffusing 512×512 images directly → GPU-days of compute
- High resolution is practically infeasible
- High quality but impractical
GAN (Generative Adversarial Networks)
- Fast generation
- Risk of mode collapse (stuck generating only certain patterns)
- Unstable training
LDM’s breakthrough
| Comparison | DDPM (pixel space) | GAN | LDM |
|---|---|---|---|
| Compute cost | Very high | Low | Drastically reduced |
| Image quality | High | High | High |
| Training stability | Stable | Unstable | Stable |
| Diversity | High | Mode collapse | High |
| Conditioning | Difficult | Difficult | Flexible via Cross-Attention |
LDM achieves the best of both worlds: the quality and stability of diffusion models with the computational efficiency of GANs.
What’s the core idea?
LDM’s architecture has three components.
1. Perceptual Compression: AutoEncoder
Converts images from pixel space to latent space.
Input image (512×512×3)
↓ Encoder E
Latent representation z (64×64×4) ← diffusion happens here
↓ Decoder D
Output image (512×512×3)
- Compression ratio: 512×512×3 = 786,432 dimensions → 64×64×4 = 16,384 dimensions (~48× compression)
- Preserves perceptually important information while removing high-frequency noise
- KL or VQ regularization structures the latent space
This compression drastically reduces the compute cost of the diffusion process.
2. Diffusion in Latent Space: U-Net
Adds noise (diffusion) and removes noise (reverse diffusion) in the compressed latent space.
Forward process (training):
z₀ (original latent)
→ z₁ (add a little noise)
→ z₂ (add more noise)
→ ...
→ z_T (pure noise)
Reverse process (generation):
z_T (random noise)
→ z_{T-1} (remove a little noise)
→ ...
→ z₀ (clean latent)
→ Decoder D → output image
The U-Net predicts “how much noise to remove from this noisy image.”
3. Text Conditioning via Cross-Attention
LDM’s revolutionary contribution is enabling flexible conditioning via Cross-Attention layers.
Prompt "a Japanese woman in a cafe"
↓
CLIP text encoder → text embedding (77×768)
↓
Cross-Attention at each U-Net layer
↓
Text-conditioned denoising
How Cross-Attention works:
- Query (Q): U-Net intermediate features (image-side information)
- Key (K): Text embedding (text-side information)
- Value (V): Text embedding
Attention(Q, K, V) = softmax(QK^T / √d) × V
This lets each spatial position in the U-Net learn “which part of the text to attend to.” For example, the region corresponding to cafe will be generated with cafe-related elements.
For details on CLIP’s text encoder, see the CLIP Paper Summary.
Two-Phase Training
LDM is trained in two phases:
Phase 1: AutoEncoder Training
- Learns image compression and reconstruction
- Acquires latent space structure
Phase 2: Diffusion Model Training
- Freezes the Phase 1 AutoEncoder
- Learns diffusion and reverse diffusion in latent space
- Learns text conditioning via Cross-Attention layers
This separation allows each phase to be optimized independently, stabilizing training.
Downsampling Factor Selection
The paper compares different compression ratios (f = 1, 2, 4, 8, 16, 32):
| Factor f | Latent size (512 input) | Result |
|---|---|---|
| f = 1 | 512×512 | Same as pixel space. Slow |
| f = 4 | 128×128 | Best quality-speed balance |
| f = 8 | 64×64 | Fast but some fine detail lost |
| f = 16 | 32×32 | Fast but quality drops |
| f = 32 | 16×16 | Significant detail loss |
Stable Diffusion uses f = 8.
How was it validated?
Quantitative Evaluation
Achieved state-of-the-art or competitive FID on multiple datasets:
| Dataset | FID | Notes |
|---|---|---|
| CelebA-HQ (256×256) | 5.15 | Face image generation |
| FFHQ (256×256) | 4.98 | High-quality face images |
| LSUN-Churches (256×256) | 4.48 | Church images |
| LSUN-Bedrooms (256×256) | 2.95 | Bedroom images |
Compute Cost Comparison
Dramatically faster inference compared to pixel-space diffusion models:
- LDM-4: approximately 4–8× faster than pixel-space models
- LDM-8 (Stable Diffusion): even faster
- Practical generation speed on a single V100 GPU
Application to Diverse Tasks
LDM was validated on tasks beyond image generation:
- Text-to-image generation: Text conditioning via Cross-Attention
- Inpainting: Filling in masked regions
- Super-resolution: Low-to-high resolution conversion
- Layout-to-image: Generating images from bounding boxes
- Semantic image synthesis: Generating images from segmentation maps
Are there limitations?
Limitations
- Detail reproduction: Compression into latent space can lose fine details (text, fingers, etc.)
- Latent space bottleneck: High compression = lower quality; low compression = reduced compute savings
- Two-phase training complexity: AutoEncoder quality affects overall quality
- Text conditioning limits: CLIP’s 75-token limit constrains long text descriptions
The Finger Problem
The common “wrong number of fingers” issue in AI image generation is partly due to LDM’s latent space compression. Fine structures like fingers are easily lost in the latent space, making accurate reproduction difficult. This is why specifying missing fingers, extra fingers in negative prompts helps.
What to read next
| Paper | Relevance |
|---|---|
| DDPM: Denoising Diffusion Probabilistic Models (Ho et al., 2020) | Diffusion model foundations. The method LDM builds upon |
| CLIP (Radford et al., 2021) | Used for text condition encoding → CLIP Paper Summary |
| Classifier-Free Diffusion Guidance (Ho & Salimans, 2022) | Guidance method used with LDM → CFG Paper Summary |
| SDXL (Podell et al., 2023) | Improved LDM. Higher-quality image generation |
| U-Net (Ronneberger et al., 2015) | Architecture used in LDM’s diffusion model |
Impact on AI image generation
LDM was released as Stable Diffusion and is the most important paper for democratizing AI image generation. Its open-source release spawned countless derivative models, including z-image-turbo.
The “KSampler” steps and sampler settings you configure in ComfyUI workflows correspond directly to the reverse diffusion process parameters described in this paper.
Related Articles
- What is z-image-turbo — An LDM-based high-speed image generation model
- ComfyUI Workflow — UI for controlling LDM parameters
- Prompt Basics — CLIP token limits and prompt word order
- CLIP Paper Summary — The foundation of text conditioning
- CFG Paper Summary — The theory behind negative prompts
![[Paper Summary] Latent Diffusion Models (Stable Diffusion) | The Core Technology of AI Image Generation](/papers/latent-diffusion-models/cover.webp)
![[Paper Summary] CLIP | The AI Foundation Linking Text and Images](/papers/clip/cover.webp)
![[Paper Summary] Classifier-Free Diffusion Guidance | The Theory Behind Negative Prompts](/papers/classifier-free-diffusion-guidance/cover.webp)



