[Paper Summary] Classifier-Free Diffusion Guidance | The Theory Behind Negative Prompts

Paper Info
Title: Classifier-Free Diffusion Guidance
Authors: Jonathan Ho, Tim Salimans (Google Brain)
Published: 2022
arXiv: 2207.12598

To understand how negative prompts work in AI image generation, this paper is essential. It provides the theoretical foundation behind the concepts discussed in Prompt Basics and the Negative Prompt Complete Guide.

What is it?

This paper introduces a method to control how closely a conditional diffusion model follows a text prompt (i.e., a text-to-image model).

Previous methods (Classifier Guidance) required a separately trained image classifier. This method achieves equivalent or better guidance without any classifier.

In a nutshell: the model is trained to predict both a “text-conditioned image” and a “text-ignoring image,” and then the difference between the two is amplified to produce outputs that more faithfully follow the prompt.

What makes it better than prior work?

Classifier Guidance (prior method)

Proposed by Dhariwal & Nichol, 2021:

Requires training a separate image classifier alongside the diffusion model
Uses classifier gradients to correct noise predictions
Generation quality is limited by classifier quality

Classifier-Free Guidance (this paper)

No classifier needed — one diffusion model does it all
Simple to implement (just randomly drop text conditioning during training)
Works with any type of conditioning, not just text
Adopted as the standard technique in Stable Diffusion, DALL-E 2, Midjourney, and other major services

What’s the core idea?

Training: Random Condition Dropping

During training, with probability p_uncond (e.g., 10–20%), the text condition is replaced with an empty string (∅). This gives one model two prediction capabilities:

Conditional prediction ε(x_t, t, c): noise prediction conditioned on text c
Unconditional prediction ε(x_t, t, ∅): noise prediction without text

Inference: Control via Guidance Scale

At inference, a guidance scale w is used to combine the two predictions:

ε̂ = ε(x_t, t, ∅) + w × [ε(x_t, t, c) − ε(x_t, t, ∅)]

Which simplifies to:

ε̂ = (1 − w) × ε(x_t, t, ∅) + w × ε(x_t, t, c)

Guidance scale w	Effect
w = 0	Unconditional only (text completely ignored)
w = 1	Standard conditional prediction (no amplification)
w > 1	Stronger text adherence (practical range)
w = 7.5	Stable Diffusion default
w » 1	Over-amplified (image quality degrades)

Intuition: The difference vector between “text-following” and “text-ignoring” directions is amplified by w. Higher w = more faithful to text, but too high = unnatural results.

Application to Negative Prompts

While not a direct contribution of this paper, the formula above is how negative prompts are implemented.

The “empty string” in the unconditional prediction ε(x_t, t, ∅) is replaced with another text condition (the negative prompt):

ε̂ = ε(x_t, t, c_negative) + w × [ε(x_t, t, c_positive) − ε(x_t, t, c_negative)]

In other words:

c_positive = what you want generated (the main prompt)
c_negative = what you don’t want (the negative prompt)

The model generates images by “moving away from the negative prompt direction, toward the positive prompt direction.”

This is why specifying (worst quality, low quality:1.4) in negative prompts improves output quality.

Relationship to z-image-turbo

Turbo/distilled models like z-image-turbo bake guidance into the model itself through distillation, running at a default CFG of 1.0. Unlike standard Stable Diffusion models that require CFG around 7.5, they follow prompts without inference-time CFG amplification.

However, negative prompts do not work at CFG=1.0. Substituting w=1.0 into the formula above cancels out the negative prompt term, leaving only the positive prompt. If you need negative prompts for quality control, use a standard Stable Diffusion model that supports CFG > 1.0.

How was it validated?

Evaluation Metrics

FID (Fréchet Inception Distance): Image quality (lower is better)
IS (Inception Score): Quality and diversity balance (higher is better)

Key Results

Condition	FID	Notes
CFG w=1.0 (no guidance)	High	Weak text adherence
CFG w=3.0	Improved	Starts to balance
CFG w=7.5	Best quality	On COCO dataset
CFG w=15.0	Slight degradation	Over-guided

Achieves generation quality comparable to or better than Classifier Guidance at lower computational cost.

Are there limitations?

Trade-offs

Quality vs. diversity: Raising the guidance scale improves quality but reduces variety in generated images
Optimal w is dataset-dependent: The best guidance scale varies by image type

Limitations

Guidance scale selection is empirical (no theoretical derivation of the optimal value)
The training-time dropout probability p_uncond requires tuning
Artifacts from excessive guidance (color saturation, unnatural textures)

Computational Cost

Inference requires two forward passes (conditional + unconditional), roughly doubling compute cost. Subsequent research proposed distillation methods to address this.

What to read next

Paper	Relevance
Diffusion Models Beat GANs on Image Synthesis (Dhariwal & Nichol, 2021)	The original Classifier Guidance paper — direct predecessor
CLIP (Radford et al., 2021)	Used for text condition embedding → CLIP Paper Summary
Latent Diffusion Models (Rombach et al., 2022)	The foundation of Stable Diffusion, which uses this method → LDM Paper Summary
Progressive Distillation for Fast Sampling of Diffusion Models	Distillation approach to CFG’s compute cost problem
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis	Improved LDM. Includes practical CFG usage

Impact on AI image generation

Classifier-Free Diffusion Guidance has become a standard component in modern text-to-image models. Stable Diffusion, DALL-E 2, Imagen, Midjourney — all use this method. The “guidance scale” and “negative prompt” that users interact with daily are grounded in this paper’s theory.

Prompt Basics — Emphasis syntax and negative prompt practice
Negative Prompt Complete Guide — Using negative prompts based on CFG
CLIP Paper Summary — Text condition embedding
LDM Paper Summary — The full picture of latent diffusion models

What is it?

What makes it better than prior work?

Classifier Guidance (prior method)

Classifier-Free Guidance (this paper)

What’s the core idea?

Training: Random Condition Dropping

Inference: Control via Guidance Scale

Application to Negative Prompts

Relationship to z-image-turbo

How was it validated?

Evaluation Metrics

Key Results

Are there limitations?

Trade-offs

Limitations

Computational Cost

What to read next

Impact on AI image generation

Related Articles

関連記事

[Paper Summary] CLIP | The AI Foundation Linking Text and Images

[Paper Summary] Latent Diffusion Models (Stable Diffusion) | The Core Technology of AI Image Generation

Finding the Prompt for "Muscular Woman Princess-Carrying a Man"

Pushing Thigh Thickness to the Limit | very thick → extremely thick → massive → thunder thighs

Breasts on a Silver Tray: Testing 107 Images to Find the Right Prompt

Can You Compress a 130-Token Prompt to 55 Tokens Without Losing Quality?