The Rules of AI Image Generation Prompts | Word Order, Emphasis Syntax, and Negative Prompt Basics

Have you ever felt like “I can’t get the image I want” with AI image generation?

The truth is, prompts follow clear rules. Understanding these rules alone can have a major impact on the quality of generated images.

This article covers the basic prompt rules common to Stable Diffusion-based models, including z-image-turbo.

Prompt Word Order Rules

Different positions in a prompt have different levels of influence. This comes from how CLIP (the text encoder) processes prompts.

The Beginning Is Most Important

Elements written at the start of a prompt are most strongly reflected in the generated image. I actually ran experiments on z-image-turbo using the same seed (seed=42) and the same elements, only changing word order.

Experiment 1: Swapping the order of “portrait” and “cafe”

portrait first	cafe first

A: portrait first

portrait of a Japanese woman, smiling, cafe background, natural lighting, 85mm lens

B: cafe first

cafe background, natural lighting, smiling, portrait of a Japanese woman, 85mm lens

Result: Portrait-first (A) gives a bust-up, subject-centered composition. Cafe-first (B) pulls back slightly, with the subject visible from about knee height. The leading element influences the overall composition of the image.

Experiment 2: Changing the leading style keyword

A comparison where only the leading style keyword is changed (seed=42). This experiment demonstrates that the choice of style keyword placed first determines the overall direction of the image — not just word order, but the actual keyword selection.

photorealistic first	anime illustration first

A: photorealistic first

photorealistic portrait of a Japanese woman, detailed skin texture, natural lighting, 85mm lens, professional photography

B: anime illustration first

anime illustration of a Japanese woman, detailed skin texture, natural lighting, 85mm lens, professional photography

Result: Changing the leading style keyword transformed the image from a photo with realistic skin texture to an anime-style illustration. The remaining elements (detailed skin texture, 85mm lens, etc.) are identical, but the choice of style keyword placed first determines the overall direction of the image. Note that this experiment is not a simple word-order swap but an actual keyword substitution — please interpret it as demonstrating the magnitude of style keyword influence.

About the Influence of the End

Due to CLIP’s positional encoding, elements at the end also carry some influence. Middle portions tend to have relatively weaker influence. However, this effect has not been experimentally verified in this article — it is presented as a generally discussed tendency.

Beginning (most important) → Middle (weaker) → End (some influence)

Therefore, the prompt structure should be:

Beginning: Subject/theme (what to generate)
Middle: Supplementary elements (outfit, pose, props, etc.)
End: Quality/technical settings (camera, lighting, image quality instructions)

Example prompt with word order in mind

portrait of a beautiful Japanese woman in her 20s, long black hair, white blouse, sitting in a modern cafe, warm afternoon sunlight, shallow depth of field, 85mm lens, professional photography

In this example:

Beginning: portrait of a beautiful Japanese woman in her 20s (subject)
Middle: long black hair, white blouse, sitting in a modern cafe (supplementary)
End: shallow depth of field, 85mm lens, professional photography (quality)

CLIP’s 75-Token Limit

In most Stable Diffusion-based models, CLIP processes prompts in 75-token chunks. Exceeding 75 tokens splits the prompt into the next chunk.

The first chunk (tokens 1–75) has the strongest influence
Very long prompts may see the latter half have weaker effects
Keep important elements within the first 75 tokens for best results

In English, 1 word ≈ 1–2 tokens. 75 tokens is roughly equivalent to 40–60 words.

Concrete examples of token counting

The CLIP tokenizer (BPE method) maps common English words to 1 token each, while uncommon words or compound words are split into subwords. “Word count” and token count do not match, so be careful.

Input	Token split	Token count
`photo`	`photo`	1
`woman`	`woman`	1
`yukata`	`yuk` + `ata`	2
`bokeh`	`bo` + `keh`	2
`vignette`	`vig` + `nette`	2
`rumpled`	`ru` + `mp` + `led`	3
`close-up`	`close` + `-` + `up`	3
`20yo`	`2` + `0` + `yo`	3
`,` (comma)	`,`	1
`.` (period)	`.`	1

Technical terms and English words derived from Japanese (yukata, bokeh, etc.) tend to be split into subwords, making actual token counts 1.3–1.5 times the word count. You can measure accurately with Python’s transformers library:

from transformers import CLIPTokenizer
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
tokens = tokenizer("your prompt here")
print(len(tokens["input_ids"]) - 2)  # Token count excluding BOS/EOS

Emphasis Syntax

Many image generation UIs allow you to numerically adjust the influence of specific elements using (element:weight) syntax.

Basic Syntax: `(element:weight)`

(smiling:1.4)     → intends to emphasize "smiling" influence by 1.4x
(background:0.7)  → intends to suppress "background" influence to 0.7x

Default weight: 1.0 (when nothing is specified)
Emphasis: values greater than 1.0
Suppression: values less than 1.0

Commonly Cited Weight Value Reference

Value	Intended effect
0.5–0.7	Significantly weaken
0.8–0.9	Slightly weaken
1.0	Default
1.1–1.3	Slightly emphasize
1.4–1.5	Strongly emphasize
1.6+	Excessive emphasis (risk of image breakdown)

Experiment: Weight Emphasis Effects in z-image-turbo

Experiment 2-A: Comparing different weights for smiling

Compared only the weight of smiling with the same seed (seed=42).

(smiling:1.0)	(smiling:1.4)

smiling:1.0 (default)

portrait of a Japanese woman, (smiling:1.0), cafe background, natural lighting, 85mm lens

smiling:1.4 (emphasized)

portrait of a Japanese woman, (smiling:1.4), cafe background, natural lighting, 85mm lens

Result: No visible difference was confirmed.

Experiment 2-B: Additional verification across 5 categories × 3 seeds

Weight emphasis effects were also tested not just for smiling, but for composition, lighting, style, and subject attributes. Each category compared unweighted (equivalent to 1.0) against (element:1.4) across 3 seeds (seed=42, 7295072554507705269, 4517457392071889496).

Category	Parameter	1.0 vs 1.4 difference
Expression	smiling	No difference (3/3 seeds)
Composition	from below	No difference (3/3 seeds)
Lighting	strong backlighting	No difference (3/3 seeds)
Style	film grain	No difference (3/3 seeds)
Subject attribute	freckles	No difference (3/3 seeds)

For detailed comparison images, see Weight Syntax Category Verification.

Result: In z-image-turbo, no change in attribute strength/weakness due to weight values was confirmed across all 5 categories.

Note: Even with a fixed seed, smiling and (smiling:1.4) produce changes in composition, outfit, and face. This is not the effect of weight values — it is a side effect from the entire token sequence changing due to the added parentheses, colon, and number.

Practical takeaway: To change output in z-image-turbo, word order and element selection (including or excluding an element, placing it at the start or end) is effective — not fine-tuning weight values.

Handling in Other Models

The above results are for z-image-turbo (a distilled model with CFG=1.0). Models with CFG greater than 1.0 (Stable Diffusion 1.5, SDXL, etc.) may have functional weight syntax. Check the documentation for the model you’re using.

Nested Parentheses for Emphasis

Some UIs support nested parentheses for emphasis:

(smiling)     → 1.1x
((smiling))   → 1.21x (1.1 × 1.1)
(((smiling))) → 1.331x (1.1 × 1.1 × 1.1)

In z-image-turbo, no effect has been confirmed for this method either.

About Negative Prompts

Negative prompts are a mechanism for specifying elements you don’t want generated, based on Classifier-Free Guidance (CFG).

Important: Negative prompts do not function in z-image-turbo. z-image-turbo is a distilled model operating at CFG=1.0, so the negative prompt mechanism does not work. For improving image quality in z-image-turbo, optimizing positive prompts is effective. See Prompt Best Practices for details.

For details on negative prompts in models with CFG > 1.0 (standard Stable Diffusion models, etc.), see Negative Prompt Guide.

Recommended Settings for z-image-turbo

z-image-turbo is a model known for fast generation.

Recommended Prompt Structure

[subject description], [supplementary description]

Since z-image-turbo produces realistic output by default, quality keywords like RAW photo or photorealistic are unnecessary (see verification results).

Recommended Parameters

Parameter	Recommended value	Description
Steps	8	z-image-turbo can produce high-quality output with fewer steps
Sampler	euler	Fast and stable
CFG	1.0	Fixed. Negative prompts do not function with this setting
Size	1024x1024 / 1280x720	Standard to widescreen

A ComfyUI workflow for z-image-turbo (with optimal parameter settings) is available in this article.

Summary

The three basic rules of prompts:

Word order: The beginning is most important, the end also matters. Write in the order: subject → supplementary → quality
Emphasis syntax: Emphasize important elements with (element:1.3). 1.2–1.4 is the practical range
Negative prompts: Do not function in z-image-turbo (due to CFG=1.0). Improve quality by optimizing positive prompts

With these rules understood, proceed to the next steps:

If you want to practice → See actual prompts in Prompt Examples Collection
If you want to design your own prompts → Read How to Think About Prompt Design
Verified prompt knowledge → Go to Prompt Best Practices
If you want to set up your environment → Try it in your browser with ConoHa AI Canvas Getting Started Guide

References

The theoretical background behind this article’s claims, explained using the Ochiai Method:

Paper Breakdown: CLIP — The CLIP model that vectorizes prompts. The origin of the 75-token limit (original paper)
Paper Breakdown: Latent Diffusion Models — The foundation of Stable Diffusion. The diffusion process in latent space (original paper)
Paper Breakdown: Classifier-Free Diffusion Guidance — The theoretical basis for negative prompts (original paper)

External links:

ComfyUI Official Repository — Node-based Stable Diffusion UI
z-image-turbo Official Site — Official documentation for the z-image-turbo model

PR RunPod クラウドGPUでAI画像生成 RunPodを始める →