Conclusions
What Happens with Chunk Splitting
- The 2nd chunk is not “ignored” — it’s “weakened” — Experiment 1 confirmed that wisteria, bridges, lanterns, etc. from the 2nd chunk were partially reflected
- 2nd chunk elements are unstable — Different elements appear with each generation. Put what you need in the 1st chunk for guaranteed reflection
- Contradictory instructions: the one at the front wins overwhelmingly — In the red vs. black hair experiment, the color placed first was dominant, and the color at the end was nearly ignored. Even then, it occasionally shows partial reflection — just not reliably
Practical Guidelines
| Position | Use |
|---|---|
| 1st chunk (tokens 1–75) | All elements you want definitely reflected. Subject, location, outfit, pose, lighting, style |
| 2nd chunk (token 76+) | Supplementary elements that are nice to have but not essential. Quality keywords, background details, etc. |
| Don’t include | Ineffective elements like coherent anatomy, contradictory instructions |
Recommendation: Keep Important Elements Within 75 Tokens
The reason the god prompts summer festival polaroid (48 tokens) and café snapshot (42 tokens) are stable is that all elements fit in one chunk. Images are generated even over 75 tokens, but reflection of overflow elements becomes unstable.
Difference from LLMs
This constraint is specific to CLIP’s architecture. LLMs like ChatGPT and Claude process 128K to 1M tokens at once with no chunk splitting. The reason prompts feel short in Stable Diffusion is CLIP’s 75-token limit.
Flux.1 features a T5 text encoder (512-token capable) in addition to CLIP, improving handling of longer prompts.
Related Articles
In Prompt Basics, I wrote “exceeding 75 tokens splits into the next chunk, which has weaker influence.” However, what a chunk is, why it gets weaker, and whether it still matters when weaker were insufficiently explained.
This article explains the mechanism and verifies it experimentally.
What Is a Chunk?
CLIP’s text encoder is designed to process input text with a maximum of 77 tokens (including BOS/EOS tokens, effectively 75 tokens). This is a constraint fixed during CLIP’s training, derived from the model’s architecture (Transformer positional encoding).
When a prompt exceeding 75 tokens is input, Stable Diffusion implementations (ComfyUI, A1111, etc.) process it as follows:
Prompt: [A, B, C, D, E, F, G, ...] (say 100 tokens)
Chunk 1: [BOS, A, B, C, ... token 75, EOS] ← Input to CLIP independently
Chunk 2: [BOS, token 76, ... token 100, padding..., EOS] ← Input to CLIP independently
→ Two output vectors combined and passed to U-Net/DiT
In other words, a chunk = a 75-token block that CLIP processes in one pass. Anything beyond 75 tokens forms a 2nd chunk, input to CLIP separately.
Why Is the 2nd Chunk “Weaker”?
The following is an estimated mechanism and has not been directly confirmed by experiment.
Reason 1: Positional disadvantage in Cross-Attention
LDM’s U-Net (or DiT Transformer) references CLIP output via Cross-Attention. At this point, chunk 1 information is referenced from the early steps of Cross-Attention, while chunk 2 is positioned later after concatenation, giving it relatively lower attention.
Reason 2: Context break between chunks
CLIP processes each chunk independently. This means the context of chunk 1 (“a woman in a flower garden”) is not carried over to chunk 2 (“there’s a bridge, there are butterflies”). Chunk 2 elements are interpreted without context, so consistency with chunk 1’s subject and scene is not guaranteed.
Reason 3: Overall composition decided in early diffusion steps
The overall composition, color tone, and main subjects are decided in the early diffusion steps. Chunk 1 is most strongly referenced at this point. Even if chunk 2 is referenced in later steps, the overall structure is already set, so it only adds details.
Is This CLIP-Specific? Differences from LLMs
CLIP’s chunk splitting is fundamentally different from LLM (e.g., ChatGPT) context limits.
| CLIP chunk splitting | LLM context window | |
|---|---|---|
| Split mechanism | Physically cuts at 75 tokens | Processes the entire window at once within the limit |
| Context continuity | None (each chunk is independent) | Yes (all tokens within the limit cross-reference each other) |
| Handling of overflow | Processed separately as 2nd chunk | Truncated or error |
| Position influence | Front is strongest, end is also strong | Generally uniform (though Recency Bias exists) |
LLMs can process long contexts like 128K tokens all at once, but CLIP has an extremely short window of only 75 tokens, with overflow processed separately without context.
Experiment: Are 2nd Chunk Elements Reflected?
Experiment 1: Effect by Prompt Length
Comparing short, medium, and long prompts on the same theme (woman in a flower garden).
Short (~15 tokens, 1/5 of one chunk)
| Result 1 | Result 2 | Result 3 |
|---|---|---|
![]() | ![]() | ![]() |
Result: Simple portrait. Background is a wall or street, clothing is a T-shirt or one-piece. No flower garden, white dress, or cherry blossom elements appear.
Medium (~75 tokens, exactly 1 chunk)
| Result 1 | Result 2 |
|---|---|
![]() | ![]() |
Result: Flower garden + white dress + cherry blossoms + necklace all reflected. Fitting in one chunk means all elements are fully effective.
Long (~150 tokens, 2 chunks)
Adding to the medium prompt: birds flying in the sky, distant mountains with snow caps, a small stream flowing nearby, wooden bridge in background, moss covered stone lantern, wisteria hanging from pergola, butterflies around flowers, dappled sunlight through leaves, mist in the valley below
| Result 1 | Result 2 | Result 3 |
|---|---|---|
![]() | ![]() | ![]() |
Result: Elements placed in the 2nd chunk are partially reflected. Wisteria, wooden bridge, stone lantern, birds, butterflies, and mountains appear in the frame. However, not all elements appear every time, and different elements appear per image.
Experiment 1 Summary
| Length | Chunks | 1st chunk elements | 2nd chunk elements |
|---|---|---|---|
| Short (15 tokens) | 1 | Fully reflected | — |
| Medium (75 tokens) | 1 | Fully reflected | — |
| Long (150 tokens) | 2 | Fully reflected | Partially reflected |
The 2nd chunk is not “ignored” — it is “partially reflected.” However, stability is low and which elements appear varies per generation.
Experiment 2: What Happens with Contradictory Instructions by Position?
Testing which wins when contradictory instructions red hair and long black hair are placed at the front vs. end.
Pattern A: red hair at front
| Result 1 | Result 2 | Result 3 |
|---|---|---|
![]() | ![]() | ![]() |
Result: 2 out of 3 images show red hair winning (chunk-09, chunk-10). The remaining 1 (chunk-11) has red at the roots and black at the tips, mixing both instructions. Overall the front red hair is dominant, but the rear long black hair is not completely ignored.
Lab Director comment: Getting a gradient with red at the roots and black at the tips from contradictory instructions — it’s like accidentally creating a design color, weirdly cool.
Pattern B: red hair at end
| Result 1 | Result 2 | Result 3 |
|---|---|---|
![]() | ![]() | ![]() |
Result: 1 out of 3 (chunk-12) showed pink to red gradient at the tips, but the other 2 (chunk-13, chunk-14) were nearly pure black. The end red hair has a tendency to be nearly ignored, with the front-positioned long black hair overwhelmingly dominant.
Experiment 2 Summary
| Position | Result | Interpretation |
|---|---|---|
red hair at front | Red hair dominant (red wins in 2/3, 1 has red-black mix) | Front element is strongest (as per word order rule) |
red hair at end | Black hair nearly wins (pure black in 2/3, only 1 has red at tips) | End instructions tend to be nearly ignored. Even when reflected, only partially |
Lab Director comment: So, the lesson from this time is “put important elements in the first 75 tokens at the front.” Treat the 2nd chunk as a backup — put everything critical in the 1st chunk for reliable results.


![[Paper Summary] CLIP | The AI Foundation Linking Text and Images](/papers/clip/cover.webp)
![[Paper Summary] Latent Diffusion Models (Stable Diffusion) | The Core Technology of AI Image Generation](/papers/latent-diffusion-models/cover.webp)



















