[Verified] Image Generation Prompt Best Practices

[Verified] Image Generation Prompt Best Practices

This article aggregates results from the blog’s individual verification articles, where images were actually compared and examined. Only experimentally substantiated findings are presented here — not “commonly cited techniques.”

Target Model

The knowledge in this article was verified in the following environment. Results may not necessarily apply to other models or parameters.

ItemValue
Modelz-image-turbo (6B parameters, photorealistic distilled model)
Inference steps8
Samplereuler
Schedulerddim_uniform
CFG1.0 (guidance built into the model)
Image size1024×1024

Verified Effective Elements

1. Scene Description Tags Are the Primary Driver of the Image

Specific scene description tags like small cafe window seat, natural overcast daylight through glass, sitting, looking out window are the dominant factor controlling composition, lighting, and atmosphere.

Even when the entire opening natural language sentence (A candid iPhone snapshot of an actress in her everyday life) was deleted wholesale, the image showed no notable change as long as the scene description tags remained.

Basis: Profession Prompt Verification Experiment 2, Group E

2. Leading Style Keywords Determine the Overall Direction of the Image

Placing style keywords like photorealistic or anime illustration at the beginning completely changes the overall direction of the image. The leading subject specification (portrait vs cafe) also affects how close or wide the composition is.

Basis: Prompt Basics Experiments 1 and 3

3. Lighting Descriptions Have a High Effect

Lighting specifications like golden hour warm light through window or backlit by moonlight dramatically change the atmosphere of the image.

  • Fluorescent white light → warm diagonal golden-hour light (preset-verify-05)
  • Front lighting → backlit silhouette + rim light (preset-verify-04)

In both cases, the difference between steps was very large, with clear effects.

Basis: Library Emo Composition, Moonlit Seaside

4. Specific Pose Specifications Also Contribute to Natural Hand Depiction

Specifying a pose that includes hand position — like chin resting on hands — not only reproduces that pose, but also makes finger depiction more natural. Conversely, removing the pose specification puts hands in a state of “not knowing what to do.”

Basis: God Prompt Ablation Study Test 2-C

5. actress / model Controls the Face Direction

Using actress or model pushes the face in a more striking, glamorous direction due to the influence of actress and model headshots in CLIP’s training data. If you don’t need a specific direction, a woman is sufficient.

Basis: Profession Prompt Verification Experiment 1

6. Environmental Description Adds Immediacy

Environmental elements like wet pavement reflections (reflections on wet pavement) directly contribute to the immediacy of street photography. The difference between steps with and without pavement reflections was striking.

Basis: Rainy Tokyo Neon Street

Denied Elements (Can Save Tokens)

The following elements have been confirmed through experiments to produce no notable change in the image in z-image-turbo. They can be deleted to save tokens.

Quality Keywords

ElementTokens savedBasis
coherent anatomy, correct hands and fingers7Coherent Anatomy Verification, God Prompt Ablation
RAW photo2Prompt Optimization 10 Themes (※ when other elements are present; standalone effect unverified)
photorealistic1Same. z-image-turbo is photorealistic by default
natural skin texture3Same

Redundant Modifiers

ElementTokens savedBasis
in her everyday life4Profession Prompt Verification Group D. Redundant with subsequent scene description
Entire opening natural language sentence5–10Same article, Group E. Scene description tags are sufficient
Double-specified overlapping meaningsVariableGod Prompt Ablation Tests 1-A, 1-E
Elements implied by a superordinate concept (e.g., paper lantern warm light when summer festival is present)4Same article, Test 1-B

Equipment Keywords

ElementTokens savedBasis
Camera model names (shot on Canon EOS R5, etc.)5–6Bikini Prompt Iterative Improvement
iPhone (for candid snapshot feel)1Profession Prompt Verification Group B

Note on Emphasis Syntax (element:weight)

In z-image-turbo, no change in attribute strength/weakness from weight syntax like (element:1.4) has been confirmed. Verified across 5 categories × 3 seeds — expression, composition, lighting, style, and subject attributes — with no visible difference between 1.0 and 1.4 in any case.

However, since the parentheses, colon, and number in the weight syntax change the token sequence, the overall image changes even with a fixed seed. This is a side effect of token sequence change, not the effect of the weight value.

Basis: Prompt Basics Experiment 2, Weight Syntax Category Verification

Practical Token Optimization

CLIP processes 1 chunk of up to 77 tokens (effectively 75 tokens + start/end tokens). The second chunk has weaker influence, so staying within 75 tokens is ideal.

Optimization Priority

  1. First, remove unnecessary quality keywords (coherent anatomy, RAW photo, etc.)
  2. Remove redundant modifiers (elements implied by superordinate concepts, double specifications)
  3. Compress natural language sentences into tag sequences (A candid snapshot of an actressactress)
  4. Remove equipment keywords (camera model names)

Example: Optimizing a Cafe Snapshot

Before optimization (27 words):

Before optimization
A candid iPhone snapshot of an actress in her everyday life. 1girl, 22yo japanese woman, small cafe window seat, natural overcast daylight through glass, beige oversized knit sweater, sitting, looking out window, gentle natural expression.

After optimization (17 words):

After optimization
1girl, 22yo japanese actress, small cafe window seat, natural overcast daylight through glass, beige oversized knit sweater, sitting, looking out window, gentle natural expression.

Deleted elements: A candid iPhone snapshot of / in her everyday life — both experimentally proven to have no effect.

Verification Article Index

A list of verification articles that form the basis of this article.

ArticleVerification target
Prompt BasicsWord order, emphasis syntax, style keywords
CLIP Chunk Split Verification75-token boundary, priority of conflicting instructions
Coherent Anatomy VerificationEffect of hand/finger quality keywords
Profession Prompt VerificationProfession words, per-element effect of opening sentences
God Prompt Ablation StudyPer-element necessity of 3 “god prompts”
Prompt Optimization 10 ThemesQuality keywords, glamour expressions
Bikini Prompt Iterative ImprovementCamera model names, effect of incremental element addition
Seed Variation BaselineRange of seed variation with identical prompts
Tag Sequence vs Natural LanguageOutput differences by prompt format
Attribute Leak VerificationEffect of color-object separation/adjacency

Summary

Principles for writing prompts in z-image-turbo:

  1. Focus on scene description tags — composition, environment, pose, and lighting are the primary drivers of image quality
  2. Put style and subject first — word order affects composition
  3. Quality/equipment keywords can be omitted — z-image-turbo is photorealistic by default
  4. Aim for within 75 tokens — the second chunk has weaker influence
  5. Avoid redundant modifiers — elements implied by superordinate concepts are unnecessary
  6. Keep color and object adjacent — write like red dress with color and object together. Separating them risks the color disappearing
  7. The gap between tag sequences and natural language is small — no significant difference in major attribute reproduction. Choose based on preference
  8. Specified attributes are stable; unspecified attributes are randomized — explicitly include in the prompt every element you want to control