[Paper Summary] Classifier-Free Diffusion Guidance | The Theory Behind Negative Prompts

[Paper Summary] Classifier-Free Diffusion Guidance | The Theory Behind Negative Prompts

Paper Info

  • Title: Classifier-Free Diffusion Guidance
  • Authors: Jonathan Ho, Tim Salimans (Google Brain)
  • Published: 2022
  • arXiv: 2207.12598

To understand how negative prompts work in AI image generation, this paper is essential. It provides the theoretical foundation behind the concepts discussed in Prompt Basics and the Negative Prompt Complete Guide.

What is it?

This paper introduces a method to control how closely a conditional diffusion model follows a text prompt (i.e., a text-to-image model).

Previous methods (Classifier Guidance) required a separately trained image classifier. This method achieves equivalent or better guidance without any classifier.

In a nutshell: the model is trained to predict both a “text-conditioned image” and a “text-ignoring image,” and then the difference between the two is amplified to produce outputs that more faithfully follow the prompt.

What makes it better than prior work?

Classifier Guidance (prior method)

Proposed by Dhariwal & Nichol, 2021:

  • Requires training a separate image classifier alongside the diffusion model
  • Uses classifier gradients to correct noise predictions
  • Generation quality is limited by classifier quality

Classifier-Free Guidance (this paper)

  • No classifier needed — one diffusion model does it all
  • Simple to implement (just randomly drop text conditioning during training)
  • Works with any type of conditioning, not just text
  • Adopted as the standard technique in Stable Diffusion, DALL-E 2, Midjourney, and other major services

What’s the core idea?

Training: Random Condition Dropping

During training, with probability p_uncond (e.g., 10–20%), the text condition is replaced with an empty string (∅). This gives one model two prediction capabilities:

  • Conditional prediction ε(x_t, t, c): noise prediction conditioned on text c
  • Unconditional prediction ε(x_t, t, ∅): noise prediction without text

Inference: Control via Guidance Scale

At inference, a guidance scale w is used to combine the two predictions:

ε̂ = ε(x_t, t, ∅) + w × [ε(x_t, t, c) − ε(x_t, t, ∅)]

Which simplifies to:

ε̂ = (1 − w) × ε(x_t, t, ∅) + w × ε(x_t, t, c)
Guidance scale wEffect
w = 0Unconditional only (text completely ignored)
w = 1Standard conditional prediction (no amplification)
w > 1Stronger text adherence (practical range)
w = 7.5Stable Diffusion default
w » 1Over-amplified (image quality degrades)

Intuition: The difference vector between “text-following” and “text-ignoring” directions is amplified by w. Higher w = more faithful to text, but too high = unnatural results.

Application to Negative Prompts

While not a direct contribution of this paper, the formula above is how negative prompts are implemented.

The “empty string” in the unconditional prediction ε(x_t, t, ∅) is replaced with another text condition (the negative prompt):

ε̂ = ε(x_t, t, c_negative) + w × [ε(x_t, t, c_positive) − ε(x_t, t, c_negative)]

In other words:

  • c_positive = what you want generated (the main prompt)
  • c_negative = what you don’t want (the negative prompt)

The model generates images by “moving away from the negative prompt direction, toward the positive prompt direction.”

This is why specifying (worst quality, low quality:1.4) in negative prompts improves output quality.

Relationship to z-image-turbo

Turbo/distilled models like z-image-turbo bake guidance into the model itself through distillation, running at a default CFG of 1.0. Unlike standard Stable Diffusion models that require CFG around 7.5, they follow prompts without inference-time CFG amplification.

However, negative prompts do not work at CFG=1.0. Substituting w=1.0 into the formula above cancels out the negative prompt term, leaving only the positive prompt. If you need negative prompts for quality control, use a standard Stable Diffusion model that supports CFG > 1.0.

How was it validated?

Evaluation Metrics

  • FID (Fréchet Inception Distance): Image quality (lower is better)
  • IS (Inception Score): Quality and diversity balance (higher is better)

Key Results

ConditionFIDNotes
CFG w=1.0 (no guidance)HighWeak text adherence
CFG w=3.0ImprovedStarts to balance
CFG w=7.5Best qualityOn COCO dataset
CFG w=15.0Slight degradationOver-guided

Achieves generation quality comparable to or better than Classifier Guidance at lower computational cost.

Are there limitations?

Trade-offs

  • Quality vs. diversity: Raising the guidance scale improves quality but reduces variety in generated images
  • Optimal w is dataset-dependent: The best guidance scale varies by image type

Limitations

  • Guidance scale selection is empirical (no theoretical derivation of the optimal value)
  • The training-time dropout probability p_uncond requires tuning
  • Artifacts from excessive guidance (color saturation, unnatural textures)

Computational Cost

Inference requires two forward passes (conditional + unconditional), roughly doubling compute cost. Subsequent research proposed distillation methods to address this.

PaperRelevance
Diffusion Models Beat GANs on Image Synthesis (Dhariwal & Nichol, 2021)The original Classifier Guidance paper — direct predecessor
CLIP (Radford et al., 2021)Used for text condition embedding → CLIP Paper Summary
Latent Diffusion Models (Rombach et al., 2022)The foundation of Stable Diffusion, which uses this method → LDM Paper Summary
Progressive Distillation for Fast Sampling of Diffusion ModelsDistillation approach to CFG’s compute cost problem
SDXL: Improving Latent Diffusion Models for High-Resolution Image SynthesisImproved LDM. Includes practical CFG usage

Impact on AI image generation

Classifier-Free Diffusion Guidance has become a standard component in modern text-to-image models. Stable Diffusion, DALL-E 2, Imagen, Midjourney — all use this method. The “guidance scale” and “negative prompt” that users interact with daily are grounded in this paper’s theory.