What Does CLIP's '75-Token Chunk Split' Actually Mean? Does the 2nd Chunk Really Get Weaker?

Conclusions

What Happens with Chunk Splitting

The 2nd chunk is not “ignored” — it’s “weakened” — Experiment 1 confirmed that wisteria, bridges, lanterns, etc. from the 2nd chunk were partially reflected
2nd chunk elements are unstable — Different elements appear with each generation. Put what you need in the 1st chunk for guaranteed reflection
Contradictory instructions: the one at the front wins overwhelmingly — In the red vs. black hair experiment, the color placed first was dominant, and the color at the end was nearly ignored. Even then, it occasionally shows partial reflection — just not reliably

Practical Guidelines

Position	Use
1st chunk (tokens 1–75)	All elements you want definitely reflected. Subject, location, outfit, pose, lighting, style
2nd chunk (token 76+)	Supplementary elements that are nice to have but not essential. Quality keywords, background details, etc.
Don’t include	Ineffective elements like `coherent anatomy`, contradictory instructions

Recommendation: Keep Important Elements Within 75 Tokens

The reason the god prompts summer festival polaroid (48 tokens) and café snapshot (42 tokens) are stable is that all elements fit in one chunk. Images are generated even over 75 tokens, but reflection of overflow elements becomes unstable.

Difference from LLMs

This constraint is specific to CLIP’s architecture. LLMs like ChatGPT and Claude process 128K to 1M tokens at once with no chunk splitting. The reason prompts feel short in Stable Diffusion is CLIP’s 75-token limit.

Flux.1 features a T5 text encoder (512-token capable) in addition to CLIP, improving handling of longer prompts.

In Prompt Basics, I wrote “exceeding 75 tokens splits into the next chunk, which has weaker influence.” However, what a chunk is, why it gets weaker, and whether it still matters when weaker were insufficiently explained.

This article explains the mechanism and verifies it experimentally.

What Is a Chunk?

CLIP’s text encoder is designed to process input text with a maximum of 77 tokens (including BOS/EOS tokens, effectively 75 tokens). This is a constraint fixed during CLIP’s training, derived from the model’s architecture (Transformer positional encoding).

When a prompt exceeding 75 tokens is input, Stable Diffusion implementations (ComfyUI, A1111, etc.) process it as follows:

Prompt: [A, B, C, D, E, F, G, ...] (say 100 tokens)

Chunk 1: [BOS, A, B, C, ... token 75, EOS]  ← Input to CLIP independently
Chunk 2: [BOS, token 76, ... token 100, padding..., EOS]  ← Input to CLIP independently

→ Two output vectors combined and passed to U-Net/DiT

In other words, a chunk = a 75-token block that CLIP processes in one pass. Anything beyond 75 tokens forms a 2nd chunk, input to CLIP separately.

Why Is the 2nd Chunk “Weaker”?

The following is an estimated mechanism and has not been directly confirmed by experiment.

Reason 1: Positional disadvantage in Cross-Attention

LDM’s U-Net (or DiT Transformer) references CLIP output via Cross-Attention. At this point, chunk 1 information is referenced from the early steps of Cross-Attention, while chunk 2 is positioned later after concatenation, giving it relatively lower attention.

Reason 2: Context break between chunks

CLIP processes each chunk independently. This means the context of chunk 1 (“a woman in a flower garden”) is not carried over to chunk 2 (“there’s a bridge, there are butterflies”). Chunk 2 elements are interpreted without context, so consistency with chunk 1’s subject and scene is not guaranteed.

Reason 3: Overall composition decided in early diffusion steps

The overall composition, color tone, and main subjects are decided in the early diffusion steps. Chunk 1 is most strongly referenced at this point. Even if chunk 2 is referenced in later steps, the overall structure is already set, so it only adds details.

Is This CLIP-Specific? Differences from LLMs

CLIP’s chunk splitting is fundamentally different from LLM (e.g., ChatGPT) context limits.

	CLIP chunk splitting	LLM context window
Split mechanism	Physically cuts at 75 tokens	Processes the entire window at once within the limit
Context continuity	None (each chunk is independent)	Yes (all tokens within the limit cross-reference each other)
Handling of overflow	Processed separately as 2nd chunk	Truncated or error
Position influence	Front is strongest, end is also strong	Generally uniform (though Recency Bias exists)

LLMs can process long contexts like 128K tokens all at once, but CLIP has an extremely short window of only 75 tokens, with overflow processed separately without context.

Experiment: Are 2nd Chunk Elements Reflected?

Experiment 1: Effect by Prompt Length

Comparing short, medium, and long prompts on the same theme (woman in a flower garden).

Short (~15 tokens, 1/5 of one chunk)

Short (~15 tokens)

a 20yo japanese woman, portrait, soft smile, natural light, 85mm lens

Result 1	Result 2	Result 3

Result: Simple portrait. Background is a wall or street, clothing is a T-shirt or one-piece. No flower garden, white dress, or cherry blossom elements appear.

Medium (~75 tokens, exactly 1 chunk)

Medium (~75 tokens = 1 chunk)

Result 1	Result 2

Result: Flower garden + white dress + cherry blossoms + necklace all reflected. Fitting in one chunk means all elements are fully effective.

Long (~150 tokens, 2 chunks)

Adding to the medium prompt: birds flying in the sky, distant mountains with snow caps, a small stream flowing nearby, wooden bridge in background, moss covered stone lantern, wisteria hanging from pergola, butterflies around flowers, dappled sunlight through leaves, mist in the valley below

Long (~150 tokens = 2 chunks)

a 20yo japanese woman, portrait, soft smile, natural light, 85mm lens, standing in a flower garden, wearing a white summer dress, long black hair, cherry blossoms falling, warm afternoon sunlight, shallow depth of field, gentle breeze blowing hair, looking at camera, delicate gold necklace, professional photography, photorealistic, detailed skin texture, magazine quality, elegant pose, spring atmosphere, birds flying in the sky, distant mountains with snow caps, a small stream flowing nearby, wooden bridge in background, moss covered stone lantern, wisteria hanging from pergola, butterflies around flowers, dappled sunlight through leaves, mist in the valley below

Result 1	Result 2	Result 3

Result: Elements placed in the 2nd chunk are partially reflected. Wisteria, wooden bridge, stone lantern, birds, butterflies, and mountains appear in the frame. However, not all elements appear every time, and different elements appear per image.

Experiment 1 Summary

Length	Chunks	1st chunk elements	2nd chunk elements
Short (15 tokens)	1	Fully reflected	—
Medium (75 tokens)	1	Fully reflected	—
Long (150 tokens)	2	Fully reflected	Partially reflected

The 2nd chunk is not “ignored” — it is “partially reflected.” However, stability is low and which elements appear varies per generation.

Experiment 2: What Happens with Contradictory Instructions by Position?

Testing which wins when contradictory instructions red hair and long black hair are placed at the front vs. end.

Pattern A: `red hair` at front

Pattern A: red hair at front

red hair, a 20yo japanese woman, portrait, soft smile, natural light, 85mm lens, standing in a flower garden, wearing a white summer dress, long black hair, cherry blossoms falling

Result 1	Result 2	Result 3

Result: 2 out of 3 images show red hair winning (chunk-09, chunk-10). The remaining 1 (chunk-11) has red at the roots and black at the tips, mixing both instructions. Overall the front red hair is dominant, but the rear long black hair is not completely ignored.

Note: Contradictory instructions produced a gradient with red at the roots and black at the tips. This was an unintended color combination, but it holds up visually.

Pattern B: `red hair` at end

Pattern B: red hair at end

a 20yo japanese woman, portrait, soft smile, natural light, 85mm lens, standing in a flower garden, wearing a white summer dress, long black hair, cherry blossoms falling, red hair

Result 1	Result 2	Result 3

Result: 1 out of 3 (chunk-12) showed pink to red gradient at the tips, but the other 2 (chunk-13, chunk-14) were nearly pure black. The end red hair has a tendency to be nearly ignored, with the front-positioned long black hair overwhelmingly dominant.

Experiment 2 Summary

Position	Result	Interpretation
`red hair` at front	Red hair dominant (red wins in 2/3, 1 has red-black mix)	Front element is strongest (as per word order rule)
`red hair` at end	Black hair nearly wins (pure black in 2/3, only 1 has red at tips)	End instructions tend to be nearly ignored. Even when reflected, only partially

Note: This test confirms the principle of “placing important elements within the first 75 tokens.” The second chunk should be treated as supplementary, with critical elements kept in the first chunk for reliable results.
PR RunPod クラウドGPUでAI画像生成 RunPodを始める →

Conclusions

What Happens with Chunk Splitting

Practical Guidelines

Recommendation: Keep Important Elements Within 75 Tokens

Difference from LLMs

Related Articles

The Rules of AI Image Generation Prompts | Word Order, Emphasis Syntax, and Negative Prompt Basics

[Paper Summary] CLIP | The AI Foundation Linking Text and Images

[Paper Summary] Latent Diffusion Models (Stable Diffusion) | The Core Technology of AI Image Generation

3 "God Prompts" That Never Miss | Ablation-Verified Minimal Versions