Paper Info
- Title: Classifier-Free Diffusion Guidance
- Authors: Jonathan Ho, Tim Salimans (Google Brain)
- Published: 2022
- arXiv: 2207.12598
To understand how negative prompts work in AI image generation, this paper is essential. It provides the theoretical foundation behind the concepts discussed in Prompt Basics and the Negative Prompt Complete Guide.
What is it?
This paper introduces a method to control how closely a conditional diffusion model follows a text prompt (i.e., a text-to-image model).
Previous methods (Classifier Guidance) required a separately trained image classifier. This method achieves equivalent or better guidance without any classifier.
In a nutshell: the model is trained to predict both a “text-conditioned image” and a “text-ignoring image,” and then the difference between the two is amplified to produce outputs that more faithfully follow the prompt.
What makes it better than prior work?
Classifier Guidance (prior method)
Proposed by Dhariwal & Nichol, 2021:
- Requires training a separate image classifier alongside the diffusion model
- Uses classifier gradients to correct noise predictions
- Generation quality is limited by classifier quality
Classifier-Free Guidance (this paper)
- No classifier needed — one diffusion model does it all
- Simple to implement (just randomly drop text conditioning during training)
- Works with any type of conditioning, not just text
- Adopted as the standard technique in Stable Diffusion, DALL-E 2, Midjourney, and other major services
What’s the core idea?
Training: Random Condition Dropping
During training, with probability p_uncond (e.g., 10–20%), the text condition is replaced with an empty string (∅). This gives one model two prediction capabilities:
- Conditional prediction ε(x_t, t, c): noise prediction conditioned on text c
- Unconditional prediction ε(x_t, t, ∅): noise prediction without text
Inference: Control via Guidance Scale
At inference, a guidance scale w is used to combine the two predictions:
ε̂ = ε(x_t, t, ∅) + w × [ε(x_t, t, c) − ε(x_t, t, ∅)]
Which simplifies to:
ε̂ = (1 − w) × ε(x_t, t, ∅) + w × ε(x_t, t, c)
| Guidance scale w | Effect |
|---|---|
| w = 0 | Unconditional only (text completely ignored) |
| w = 1 | Standard conditional prediction (no amplification) |
| w > 1 | Stronger text adherence (practical range) |
| w = 7.5 | Stable Diffusion default |
| w » 1 | Over-amplified (image quality degrades) |
Intuition: The difference vector between “text-following” and “text-ignoring” directions is amplified by w. Higher w = more faithful to text, but too high = unnatural results.
Application to Negative Prompts
While not a direct contribution of this paper, the formula above is how negative prompts are implemented.
The “empty string” in the unconditional prediction ε(x_t, t, ∅) is replaced with another text condition (the negative prompt):
ε̂ = ε(x_t, t, c_negative) + w × [ε(x_t, t, c_positive) − ε(x_t, t, c_negative)]
In other words:
- c_positive = what you want generated (the main prompt)
- c_negative = what you don’t want (the negative prompt)
The model generates images by “moving away from the negative prompt direction, toward the positive prompt direction.”
This is why specifying (worst quality, low quality:1.4) in negative prompts improves output quality.
Relationship to z-image-turbo
Turbo/distilled models like z-image-turbo bake guidance into the model itself through distillation, running at a default CFG of 1.0. Unlike standard Stable Diffusion models that require CFG around 7.5, they follow prompts without inference-time CFG amplification.
However, negative prompts do not work at CFG=1.0. Substituting w=1.0 into the formula above cancels out the negative prompt term, leaving only the positive prompt. If you need negative prompts for quality control, use a standard Stable Diffusion model that supports CFG > 1.0.
How was it validated?
Evaluation Metrics
- FID (Fréchet Inception Distance): Image quality (lower is better)
- IS (Inception Score): Quality and diversity balance (higher is better)
Key Results
| Condition | FID | Notes |
|---|---|---|
| CFG w=1.0 (no guidance) | High | Weak text adherence |
| CFG w=3.0 | Improved | Starts to balance |
| CFG w=7.5 | Best quality | On COCO dataset |
| CFG w=15.0 | Slight degradation | Over-guided |
Achieves generation quality comparable to or better than Classifier Guidance at lower computational cost.
Are there limitations?
Trade-offs
- Quality vs. diversity: Raising the guidance scale improves quality but reduces variety in generated images
- Optimal w is dataset-dependent: The best guidance scale varies by image type
Limitations
- Guidance scale selection is empirical (no theoretical derivation of the optimal value)
- The training-time dropout probability p_uncond requires tuning
- Artifacts from excessive guidance (color saturation, unnatural textures)
Computational Cost
Inference requires two forward passes (conditional + unconditional), roughly doubling compute cost. Subsequent research proposed distillation methods to address this.
What to read next
| Paper | Relevance |
|---|---|
| Diffusion Models Beat GANs on Image Synthesis (Dhariwal & Nichol, 2021) | The original Classifier Guidance paper — direct predecessor |
| CLIP (Radford et al., 2021) | Used for text condition embedding → CLIP Paper Summary |
| Latent Diffusion Models (Rombach et al., 2022) | The foundation of Stable Diffusion, which uses this method → LDM Paper Summary |
| Progressive Distillation for Fast Sampling of Diffusion Models | Distillation approach to CFG’s compute cost problem |
| SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis | Improved LDM. Includes practical CFG usage |
Impact on AI image generation
Classifier-Free Diffusion Guidance has become a standard component in modern text-to-image models. Stable Diffusion, DALL-E 2, Imagen, Midjourney — all use this method. The “guidance scale” and “negative prompt” that users interact with daily are grounded in this paper’s theory.
Related Articles
- Prompt Basics — Emphasis syntax and negative prompt practice
- Negative Prompt Complete Guide — Using negative prompts based on CFG
- CLIP Paper Summary — Text condition embedding
- LDM Paper Summary — The full picture of latent diffusion models
![[Paper Summary] Classifier-Free Diffusion Guidance | The Theory Behind Negative Prompts](/papers/classifier-free-diffusion-guidance/cover.webp)
![[Paper Summary] CLIP | The AI Foundation Linking Text and Images](/papers/clip/cover.webp)
![[Paper Summary] Latent Diffusion Models (Stable Diffusion) | The Core Technology of AI Image Generation](/papers/latent-diffusion-models/cover.webp)



