[Verified] Image Generation Prompt Best Practices

This article aggregates results from the blog’s individual verification articles, where images were actually compared and examined. Only experimentally substantiated findings are presented here — not “commonly cited techniques.”

Target Model

The knowledge in this article was verified in the following environment. Results may not necessarily apply to other models or parameters.

Item	Value
Model	z-image-turbo (6B parameters, photorealistic distilled model)
Inference steps	8
Sampler	euler
Scheduler	ddim_uniform
CFG	1.0 (guidance built into the model)
Image size	1024×1024

Verified Effective Elements

1. Scene Description Tags Are the Primary Driver of the Image

Specific scene description tags like small cafe window seat, natural overcast daylight through glass, sitting, looking out window are the dominant factor controlling composition, lighting, and atmosphere.

Even when the entire opening natural language sentence (A candid iPhone snapshot of an actress in her everyday life) was deleted wholesale, the image showed no notable change as long as the scene description tags remained.

Basis: Profession Prompt Verification Experiment 2, Group E

2. Leading Style Keywords Determine the Overall Direction of the Image

Placing style keywords like photorealistic or anime illustration at the beginning completely changes the overall direction of the image. The leading subject specification (portrait vs cafe) also affects how close or wide the composition is.

Basis: Prompt Basics Experiments 1 and 3

3. Lighting Descriptions Have a High Effect

Lighting specifications like golden hour warm light through window or backlit by moonlight dramatically change the atmosphere of the image.

Fluorescent white light → warm diagonal golden-hour light (preset-verify-05)
Front lighting → backlit silhouette + rim light (preset-verify-04)

In both cases, the difference between steps was very large, with clear effects.

Basis: Library Emo Composition, Moonlit Seaside

4. Specific Pose Specifications Also Contribute to Natural Hand Depiction

Specifying a pose that includes hand position — like chin resting on hands — not only reproduces that pose, but also makes finger depiction more natural. Conversely, removing the pose specification puts hands in a state of “not knowing what to do.”

Basis: God Prompt Ablation Study Test 2-C

5. `actress` / `model` Controls the Face Direction

Using actress or model pushes the face in a more striking, glamorous direction due to the influence of actress and model headshots in CLIP’s training data. If you don’t need a specific direction, a woman is sufficient.

Basis: Profession Prompt Verification Experiment 1

6. Environmental Description Adds Immediacy

Environmental elements like wet pavement reflections (reflections on wet pavement) directly contribute to the immediacy of street photography. The difference between steps with and without pavement reflections was striking.

Basis: Rainy Tokyo Neon Street

Denied Elements (Can Save Tokens)

The following elements have been confirmed through experiments to produce no notable change in the image in z-image-turbo. They can be deleted to save tokens.

Quality Keywords

Element	Tokens saved	Basis
`coherent anatomy, correct hands and fingers`	7	Coherent Anatomy Verification, God Prompt Ablation
`RAW photo`	2	Prompt Optimization 10 Themes (※ when other elements are present; standalone effect unverified)
`photorealistic`	1	Same. z-image-turbo is photorealistic by default
`natural skin texture`	3	Same

Redundant Modifiers

Element	Tokens saved	Basis
`in her everyday life`	4	Profession Prompt Verification Group D. Redundant with subsequent scene description
Entire opening natural language sentence	5–10	Same article, Group E. Scene description tags are sufficient
Double-specified overlapping meanings	Variable	God Prompt Ablation Tests 1-A, 1-E
Elements implied by a superordinate concept (e.g., `paper lantern warm light` when `summer festival` is present)	4	Same article, Test 1-B

Equipment Keywords

Element	Tokens saved	Basis
Camera model names (`shot on Canon EOS R5`, etc.)	5–6	Bikini Prompt Iterative Improvement
`iPhone` (for candid snapshot feel)	1	Profession Prompt Verification Group B

Note on Emphasis Syntax `(element:weight)`

In z-image-turbo, no change in attribute strength/weakness from weight syntax like (element:1.4) has been confirmed. Verified across 5 categories × 3 seeds — expression, composition, lighting, style, and subject attributes — with no visible difference between 1.0 and 1.4 in any case.

However, since the parentheses, colon, and number in the weight syntax change the token sequence, the overall image changes even with a fixed seed. This is a side effect of token sequence change, not the effect of the weight value.

Basis: Prompt Basics Experiment 2, Weight Syntax Category Verification

Practical Token Optimization

CLIP processes 1 chunk of up to 77 tokens (effectively 75 tokens + start/end tokens). The second chunk has weaker influence, so staying within 75 tokens is ideal.

Optimization Priority

First, remove unnecessary quality keywords (coherent anatomy, RAW photo, etc.)
Remove redundant modifiers (elements implied by superordinate concepts, double specifications)
Compress natural language sentences into tag sequences (A candid snapshot of an actress → actress)
Remove equipment keywords (camera model names)

Example: Optimizing a Cafe Snapshot

Before optimization (27 words):

Before optimization

A candid iPhone snapshot of an actress in her everyday life. 1girl, 22yo japanese woman, small cafe window seat, natural overcast daylight through glass, beige oversized knit sweater, sitting, looking out window, gentle natural expression.

After optimization (17 words):

After optimization

1girl, 22yo japanese actress, small cafe window seat, natural overcast daylight through glass, beige oversized knit sweater, sitting, looking out window, gentle natural expression.

Deleted elements: A candid iPhone snapshot of / in her everyday life — both experimentally proven to have no effect.

Verification Article Index

A list of verification articles that form the basis of this article.

Article	Verification target
Prompt Basics	Word order, emphasis syntax, style keywords
CLIP Chunk Split Verification	75-token boundary, priority of conflicting instructions
Coherent Anatomy Verification	Effect of hand/finger quality keywords
Profession Prompt Verification	Profession words, per-element effect of opening sentences
God Prompt Ablation Study	Per-element necessity of 3 “god prompts”
Prompt Optimization 10 Themes	Quality keywords, glamour expressions
Bikini Prompt Iterative Improvement	Camera model names, effect of incremental element addition
Seed Variation Baseline	Range of seed variation with identical prompts
Tag Sequence vs Natural Language	Output differences by prompt format
Attribute Leak Verification	Effect of color-object separation/adjacency

Summary

Principles for writing prompts in z-image-turbo:

Focus on scene description tags — composition, environment, pose, and lighting are the primary drivers of image quality
Put style and subject first — word order affects composition
Quality/equipment keywords can be omitted — z-image-turbo is photorealistic by default
Aim for within 75 tokens — the second chunk has weaker influence
Avoid redundant modifiers — elements implied by superordinate concepts are unnecessary
Keep color and object adjacent — write like red dress with color and object together. Separating them risks the color disappearing
The gap between tag sequences and natural language is small — no significant difference in major attribute reproduction. Choose based on preference
Specified attributes are stable; unspecified attributes are randomized — explicitly include in the prompt every element you want to control

PR RunPod クラウドGPUでAI画像生成 RunPodを始める →