What Does CLIP's '75-Token Chunk Split' Actually Mean? Does the 2nd Chunk Really Get Weaker?

What Does CLIP's '75-Token Chunk Split' Actually Mean? Does the 2nd Chunk Really Get Weaker?

Conclusions

What Happens with Chunk Splitting

  1. The 2nd chunk is not “ignored” — it’s “weakened” — Experiment 1 confirmed that wisteria, bridges, lanterns, etc. from the 2nd chunk were partially reflected
  2. 2nd chunk elements are unstable — Different elements appear with each generation. Put what you need in the 1st chunk for guaranteed reflection
  3. Contradictory instructions: the one at the front wins overwhelmingly — In the red vs. black hair experiment, the color placed first was dominant, and the color at the end was nearly ignored. Even then, it occasionally shows partial reflection — just not reliably

Practical Guidelines

PositionUse
1st chunk (tokens 1–75)All elements you want definitely reflected. Subject, location, outfit, pose, lighting, style
2nd chunk (token 76+)Supplementary elements that are nice to have but not essential. Quality keywords, background details, etc.
Don’t includeIneffective elements like coherent anatomy, contradictory instructions

Recommendation: Keep Important Elements Within 75 Tokens

The reason the god prompts summer festival polaroid (48 tokens) and café snapshot (42 tokens) are stable is that all elements fit in one chunk. Images are generated even over 75 tokens, but reflection of overflow elements becomes unstable.

Difference from LLMs

This constraint is specific to CLIP’s architecture. LLMs like ChatGPT and Claude process 128K to 1M tokens at once with no chunk splitting. The reason prompts feel short in Stable Diffusion is CLIP’s 75-token limit.

Flux.1 features a T5 text encoder (512-token capable) in addition to CLIP, improving handling of longer prompts.

In Prompt Basics, I wrote “exceeding 75 tokens splits into the next chunk, which has weaker influence.” However, what a chunk is, why it gets weaker, and whether it still matters when weaker were insufficiently explained.

This article explains the mechanism and verifies it experimentally.

What Is a Chunk?

CLIP’s text encoder is designed to process input text with a maximum of 77 tokens (including BOS/EOS tokens, effectively 75 tokens). This is a constraint fixed during CLIP’s training, derived from the model’s architecture (Transformer positional encoding).

When a prompt exceeding 75 tokens is input, Stable Diffusion implementations (ComfyUI, A1111, etc.) process it as follows:

Prompt: [A, B, C, D, E, F, G, ...] (say 100 tokens)

Chunk 1: [BOS, A, B, C, ... token 75, EOS]  ← Input to CLIP independently
Chunk 2: [BOS, token 76, ... token 100, padding..., EOS]  ← Input to CLIP independently

→ Two output vectors combined and passed to U-Net/DiT

In other words, a chunk = a 75-token block that CLIP processes in one pass. Anything beyond 75 tokens forms a 2nd chunk, input to CLIP separately.

Why Is the 2nd Chunk “Weaker”?

The following is an estimated mechanism and has not been directly confirmed by experiment.

Reason 1: Positional disadvantage in Cross-Attention

LDM’s U-Net (or DiT Transformer) references CLIP output via Cross-Attention. At this point, chunk 1 information is referenced from the early steps of Cross-Attention, while chunk 2 is positioned later after concatenation, giving it relatively lower attention.

Reason 2: Context break between chunks

CLIP processes each chunk independently. This means the context of chunk 1 (“a woman in a flower garden”) is not carried over to chunk 2 (“there’s a bridge, there are butterflies”). Chunk 2 elements are interpreted without context, so consistency with chunk 1’s subject and scene is not guaranteed.

Reason 3: Overall composition decided in early diffusion steps

The overall composition, color tone, and main subjects are decided in the early diffusion steps. Chunk 1 is most strongly referenced at this point. Even if chunk 2 is referenced in later steps, the overall structure is already set, so it only adds details.

Is This CLIP-Specific? Differences from LLMs

CLIP’s chunk splitting is fundamentally different from LLM (e.g., ChatGPT) context limits.

CLIP chunk splittingLLM context window
Split mechanismPhysically cuts at 75 tokensProcesses the entire window at once within the limit
Context continuityNone (each chunk is independent)Yes (all tokens within the limit cross-reference each other)
Handling of overflowProcessed separately as 2nd chunkTruncated or error
Position influenceFront is strongest, end is also strongGenerally uniform (though Recency Bias exists)

LLMs can process long contexts like 128K tokens all at once, but CLIP has an extremely short window of only 75 tokens, with overflow processed separately without context.

Experiment: Are 2nd Chunk Elements Reflected?

Experiment 1: Effect by Prompt Length

Comparing short, medium, and long prompts on the same theme (woman in a flower garden).

Short (~15 tokens, 1/5 of one chunk)

Short (~15 tokens)
a 20yo japanese woman, portrait, soft smile, natural light, 85mm lens
Result 1Result 2Result 3
short1short2short3

Result: Simple portrait. Background is a wall or street, clothing is a T-shirt or one-piece. No flower garden, white dress, or cherry blossom elements appear.

Medium (~75 tokens, exactly 1 chunk)

Medium (~75 tokens = 1 chunk)
a 20yo japanese woman, portrait, soft smile, natural light, 85mm lens, standing in a flower garden, wearing a white summer dress, long black hair, cherry blossoms falling, warm afternoon sunlight, shallow depth of field, gentle breeze blowing hair, looking at camera, delicate gold necklace, professional photography, photorealistic, detailed skin texture, magazine quality, elegant pose, spring atmosphere
Result 1Result 2
medium1medium2

Result: Flower garden + white dress + cherry blossoms + necklace all reflected. Fitting in one chunk means all elements are fully effective.

Long (~150 tokens, 2 chunks)

Adding to the medium prompt: birds flying in the sky, distant mountains with snow caps, a small stream flowing nearby, wooden bridge in background, moss covered stone lantern, wisteria hanging from pergola, butterflies around flowers, dappled sunlight through leaves, mist in the valley below

Long (~150 tokens = 2 chunks)
a 20yo japanese woman, portrait, soft smile, natural light, 85mm lens, standing in a flower garden, wearing a white summer dress, long black hair, cherry blossoms falling, warm afternoon sunlight, shallow depth of field, gentle breeze blowing hair, looking at camera, delicate gold necklace, professional photography, photorealistic, detailed skin texture, magazine quality, elegant pose, spring atmosphere, birds flying in the sky, distant mountains with snow caps, a small stream flowing nearby, wooden bridge in background, moss covered stone lantern, wisteria hanging from pergola, butterflies around flowers, dappled sunlight through leaves, mist in the valley below
Result 1Result 2Result 3
long1long2long3

Result: Elements placed in the 2nd chunk are partially reflected. Wisteria, wooden bridge, stone lantern, birds, butterflies, and mountains appear in the frame. However, not all elements appear every time, and different elements appear per image.

Experiment 1 Summary

LengthChunks1st chunk elements2nd chunk elements
Short (15 tokens)1Fully reflected
Medium (75 tokens)1Fully reflected
Long (150 tokens)2Fully reflectedPartially reflected

The 2nd chunk is not “ignored” — it is “partially reflected.” However, stability is low and which elements appear varies per generation.

Experiment 2: What Happens with Contradictory Instructions by Position?

Testing which wins when contradictory instructions red hair and long black hair are placed at the front vs. end.

Pattern A: red hair at front

Pattern A: red hair at front
red hair, a 20yo japanese woman, portrait, soft smile, natural light, 85mm lens, standing in a flower garden, wearing a white summer dress, long black hair, cherry blossoms falling
Result 1Result 2Result 3
A1A2A3

Result: 2 out of 3 images show red hair winning (chunk-09, chunk-10). The remaining 1 (chunk-11) has red at the roots and black at the tips, mixing both instructions. Overall the front red hair is dominant, but the rear long black hair is not completely ignored.

Lab Director comment: Getting a gradient with red at the roots and black at the tips from contradictory instructions — it’s like accidentally creating a design color, weirdly cool.

Pattern B: red hair at end

Pattern B: red hair at end
a 20yo japanese woman, portrait, soft smile, natural light, 85mm lens, standing in a flower garden, wearing a white summer dress, long black hair, cherry blossoms falling, red hair
Result 1Result 2Result 3
B1B2B3

Result: 1 out of 3 (chunk-12) showed pink to red gradient at the tips, but the other 2 (chunk-13, chunk-14) were nearly pure black. The end red hair has a tendency to be nearly ignored, with the front-positioned long black hair overwhelmingly dominant.

Experiment 2 Summary

PositionResultInterpretation
red hair at frontRed hair dominant (red wins in 2/3, 1 has red-black mix)Front element is strongest (as per word order rule)
red hair at endBlack hair nearly wins (pure black in 2/3, only 1 has red at tips)End instructions tend to be nearly ignored. Even when reflected, only partially

Lab Director comment: So, the lesson from this time is “put important elements in the first 75 tokens at the front.” Treat the 2nd chunk as a backup — put everything critical in the 1st chunk for reliable results.