The Rules of AI Image Generation Prompts | Word Order, Emphasis Syntax, and Negative Prompt Basics

The Rules of AI Image Generation Prompts | Word Order, Emphasis Syntax, and Negative Prompt Basics

Have you ever felt like “I can’t get the image I want” with AI image generation?

The truth is, prompts follow clear rules. Understanding these rules alone can have a major impact on the quality of generated images.

This article covers the basic prompt rules common to Stable Diffusion-based models, including z-image-turbo.

Prompt Word Order Rules

Different positions in a prompt have different levels of influence. This comes from how CLIP (the text encoder) processes prompts.

The Beginning Is Most Important

Elements written at the start of a prompt are most strongly reflected in the generated image. I actually ran experiments on z-image-turbo using the same seed (seed=42) and the same elements, only changing word order.

Experiment 1: Swapping the order of “portrait” and “cafe”

portrait firstcafe first
portrait firstcafe first
A: portrait first
portrait of a Japanese woman, smiling, cafe background, natural lighting, 85mm lens
B: cafe first
cafe background, natural lighting, smiling, portrait of a Japanese woman, 85mm lens

Result: Portrait-first (A) gives a bust-up, subject-centered composition. Cafe-first (B) pulls back slightly, with the subject visible from about knee height. The leading element influences the overall composition of the image.

Experiment 2: Changing the leading style keyword

A comparison where only the leading style keyword is changed (seed=42). This experiment demonstrates that the choice of style keyword placed first determines the overall direction of the image — not just word order, but the actual keyword selection.

photorealistic firstanime illustration first
photoanime
A: photorealistic first
photorealistic portrait of a Japanese woman, detailed skin texture, natural lighting, 85mm lens, professional photography
B: anime illustration first
anime illustration of a Japanese woman, detailed skin texture, natural lighting, 85mm lens, professional photography

Result: Changing the leading style keyword transformed the image from a photo with realistic skin texture to an anime-style illustration. The remaining elements (detailed skin texture, 85mm lens, etc.) are identical, but the choice of style keyword placed first determines the overall direction of the image. Note that this experiment is not a simple word-order swap but an actual keyword substitution — please interpret it as demonstrating the magnitude of style keyword influence.

About the Influence of the End

Due to CLIP’s positional encoding, elements at the end also carry some influence. Middle portions tend to have relatively weaker influence. However, this effect has not been experimentally verified in this article — it is presented as a generally discussed tendency.

Beginning (most important) → Middle (weaker) → End (some influence)

Therefore, the prompt structure should be:

  1. Beginning: Subject/theme (what to generate)
  2. Middle: Supplementary elements (outfit, pose, props, etc.)
  3. End: Quality/technical settings (camera, lighting, image quality instructions)
Example prompt with word order in mind
portrait of a beautiful Japanese woman in her 20s, long black hair, white blouse, sitting in a modern cafe, warm afternoon sunlight, shallow depth of field, 85mm lens, professional photography

In this example:

  • Beginning: portrait of a beautiful Japanese woman in her 20s (subject)
  • Middle: long black hair, white blouse, sitting in a modern cafe (supplementary)
  • End: shallow depth of field, 85mm lens, professional photography (quality)

CLIP’s 75-Token Limit

In most Stable Diffusion-based models, CLIP processes prompts in 75-token chunks. Exceeding 75 tokens splits the prompt into the next chunk.

  • The first chunk (tokens 1–75) has the strongest influence
  • Very long prompts may see the latter half have weaker effects
  • Keep important elements within the first 75 tokens for best results

In English, 1 word ≈ 1–2 tokens. 75 tokens is roughly equivalent to 40–60 words.

Concrete examples of token counting

The CLIP tokenizer (BPE method) maps common English words to 1 token each, while uncommon words or compound words are split into subwords. “Word count” and token count do not match, so be careful.

InputToken splitToken count
photophoto1
womanwoman1
yukatayuk + ata2
bokehbo + keh2
vignettevig + nette2
rumpledru + mp + led3
close-upclose + - + up3
20yo2 + 0 + yo3
, (comma),1
. (period).1

Technical terms and English words derived from Japanese (yukata, bokeh, etc.) tend to be split into subwords, making actual token counts 1.3–1.5 times the word count. You can measure accurately with Python’s transformers library:

from transformers import CLIPTokenizer
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
tokens = tokenizer("your prompt here")
print(len(tokens["input_ids"]) - 2)  # Token count excluding BOS/EOS

Emphasis Syntax

Many image generation UIs allow you to numerically adjust the influence of specific elements using (element:weight) syntax.

Basic Syntax: (element:weight)

(smiling:1.4)     → intends to emphasize "smiling" influence by 1.4x
(background:0.7)  → intends to suppress "background" influence to 0.7x
  • Default weight: 1.0 (when nothing is specified)
  • Emphasis: values greater than 1.0
  • Suppression: values less than 1.0

Commonly Cited Weight Value Reference

ValueIntended effect
0.5–0.7Significantly weaken
0.8–0.9Slightly weaken
1.0Default
1.1–1.3Slightly emphasize
1.4–1.5Strongly emphasize
1.6+Excessive emphasis (risk of image breakdown)

Experiment: Weight Emphasis Effects in z-image-turbo

Experiment 2-A: Comparing different weights for smiling

Compared only the weight of smiling with the same seed (seed=42).

(smiling:1.0)(smiling:1.4)
smile 1.0smile 1.4
smiling:1.0 (default)
portrait of a Japanese woman, (smiling:1.0), cafe background, natural lighting, 85mm lens
smiling:1.4 (emphasized)
portrait of a Japanese woman, (smiling:1.4), cafe background, natural lighting, 85mm lens

Result: No visible difference was confirmed.

Experiment 2-B: Additional verification across 5 categories × 3 seeds

Weight emphasis effects were also tested not just for smiling, but for composition, lighting, style, and subject attributes. Each category compared unweighted (equivalent to 1.0) against (element:1.4) across 3 seeds (seed=42, 7295072554507705269, 4517457392071889496).

CategoryParameter1.0 vs 1.4 difference
ExpressionsmilingNo difference (3/3 seeds)
Compositionfrom belowNo difference (3/3 seeds)
Lightingstrong backlightingNo difference (3/3 seeds)
Stylefilm grainNo difference (3/3 seeds)
Subject attributefrecklesNo difference (3/3 seeds)

For detailed comparison images, see Weight Syntax Category Verification.

Result: In z-image-turbo, no change in attribute strength/weakness due to weight values was confirmed across all 5 categories.

Note: Even with a fixed seed, smiling and (smiling:1.4) produce changes in composition, outfit, and face. This is not the effect of weight values — it is a side effect from the entire token sequence changing due to the added parentheses, colon, and number.

Practical takeaway: To change output in z-image-turbo, word order and element selection (including or excluding an element, placing it at the start or end) is effective — not fine-tuning weight values.

Handling in Other Models

The above results are for z-image-turbo (a distilled model with CFG=1.0). Models with CFG greater than 1.0 (Stable Diffusion 1.5, SDXL, etc.) may have functional weight syntax. Check the documentation for the model you’re using.

Nested Parentheses for Emphasis

Some UIs support nested parentheses for emphasis:

(smiling)     → 1.1x
((smiling))   → 1.21x (1.1 × 1.1)
(((smiling))) → 1.331x (1.1 × 1.1 × 1.1)

In z-image-turbo, no effect has been confirmed for this method either.

About Negative Prompts

Negative prompts are a mechanism for specifying elements you don’t want generated, based on Classifier-Free Guidance (CFG).

Important: Negative prompts do not function in z-image-turbo. z-image-turbo is a distilled model operating at CFG=1.0, so the negative prompt mechanism does not work. For improving image quality in z-image-turbo, optimizing positive prompts is effective. See Prompt Best Practices for details.

For details on negative prompts in models with CFG > 1.0 (standard Stable Diffusion models, etc.), see Negative Prompt Guide.

z-image-turbo is a model known for fast generation.

[subject description], [supplementary description]

Since z-image-turbo produces realistic output by default, quality keywords like RAW photo or photorealistic are unnecessary (see verification results).

ParameterRecommended valueDescription
Steps8z-image-turbo can produce high-quality output with fewer steps
SamplereulerFast and stable
CFG1.0Fixed. Negative prompts do not function with this setting
Size1024x1024 / 1280x720Standard to widescreen

A ComfyUI workflow for z-image-turbo (with optimal parameter settings) is available in this article.

Summary

The three basic rules of prompts:

  1. Word order: The beginning is most important, the end also matters. Write in the order: subject → supplementary → quality
  2. Emphasis syntax: Emphasize important elements with (element:1.3). 1.2–1.4 is the practical range
  3. Negative prompts: Do not function in z-image-turbo (due to CFG=1.0). Improve quality by optimizing positive prompts

With these rules understood, proceed to the next steps:

References

The theoretical background behind this article’s claims, explained using the Ochiai Method:

External links: