[Paper Summary] CLIP | The AI Foundation Linking Text and Images

Paper Info
Title: Learning Transferable Visual Models From Natural Language Supervision
Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, et al. (OpenAI)
Published: ICML 2021
arXiv: 2103.00020

When you write a prompt for AI image generation, why does English text get converted into an image? The answer lies in CLIP. The “75-token limit” mentioned in Prompt Basics also traces back to CLIP.

What is it?

CLIP is a model that learns the semantic correspondence between text and images.

It can judge that the text “a photo of a cat” and an actual cat photo are “similar”
It can select the image that best matches “a sunset over the ocean” from a large collection

It was trained with contrastive learning on 400 million text-image pairs (the WIT-400M dataset) collected from the internet.

In AI image generation (Stable Diffusion etc.), the text encoder portion of CLIP is used to convert prompts into vectors.

What makes it better than prior work?

Prior approach (ImageNet pre-training)

Classification trained on a fixed set of 1,000 categories (dogs, cats, cars, etc.)
New categories require additional training data and fine-tuning
Cannot handle concepts not in the category set (e.g., “a smiling woman lit by a sunset”)

CLIP’s approach

Flexible concept representation via natural language — any text can describe an image
Zero-shot transfer — can recognize unseen categories using only text descriptions
Matches ResNet-50 zero-shot accuracy on ImageNet — achieved without task-specific training
Competitive performance on 30+ visual benchmarks (OCR, video action recognition, etc.)

What’s the core idea?

Contrastive Learning

CLIP’s key innovation is acquiring text-image correspondences through contrastive learning.

For a batch of N text-image pairs:

Positives: Matching text-image pairs → maximize similarity
Negatives: Non-matching pairs (N²-N pairs) → minimize similarity

This is optimized symmetrically (both text→image and image→text directions).

Dual Encoder Architecture

Component	Role	Output
Text Encoder	Converts text to a vector	Text embedding (512 dimensions)
Image Encoder	Converts image to a vector	Image embedding (512 dimensions)

Text encoder: Transformer (GPT-based)
Image encoder: Vision Transformer (ViT) or ResNet

Both outputs are mapped to a shared vector space and compared using cosine similarity.

WIT-400M Dataset

400 million text-image pairs collected from the internet
More than 300× the scale of ImageNet’s 1.28 million images
Covers diverse concepts (photos, illustrations, charts, memes, etc.)

The Origin of the 75-Token Limit

CLIP’s text encoder processes a maximum of 77 tokens (including BOS/EOS tokens). The practical limit is 75 tokens.

In Stable Diffusion models, prompts exceeding 75 tokens are split into chunks. This is the technical basis for the rule stated in Prompt Basics: “the first 75 tokens carry the most weight.”

Role in AI Image Generation

In text-to-image models like Stable Diffusion, CLIP is used as follows:

User's prompt (text)
    ↓
CLIP text encoder → text embedding vector
    ↓
Cross-Attention layers (inside the U-Net) condition the diffusion process
    ↓
Generated image

In other words, CLIP acts as a bridge between prompts and images.

How was it validated?

Zero-Shot Evaluation

Image classification was performed using only text prompts, with no additional training data.

Benchmark	CLIP zero-shot	Supervised ResNet-50
ImageNet	76.2%	76.1%
ImageNet-V2	70.1%	63.3%
ImageNet-Sketch	60.2%	24.8%

Particularly noteworthy is robustness to distribution shift. On ImageNet-V2 (slightly different conditions) and ImageNet-Sketch (sketch images), CLIP far outperforms the supervised model.

30+ Benchmarks

Evaluated across diverse tasks including OCR, satellite image recognition, video action recognition, and geolocalization. Achieves performance competitive with existing specialized models on many tasks.

Are there limitations?

Limitations

Task-dependent performance variance: Still lags behind specialized models on fine-grained classification (e.g., distinguishing flower species)
Prompt engineering required: Results vary greatly by how text is written (templates like “a photo of a {category}” are effective)
Bias in training data: Data collected from the internet contains societal biases
Weak on abstract concepts: Poor at handling quantity expressions like “an image with exactly 3 objects”

Implications for Prompt Writing

Since CLIP’s training data is web-collected text-image pairs, it tends to work well with expressions commonly found on the web (photo captions, product descriptions, etc.).

This is one reason why camera terms like professional photography, 85mm lens are effective in prompts — CLIP’s training data likely contains many photography-related captions.

What to read next

Paper	Relevance
ALIGN (Jia et al., 2021)	Google’s similar vision-language pre-training. Scaling with noisy data
Vision Transformer (ViT) (Dosovitskiy et al., 2020)	Architecture used for CLIP’s image encoder
Latent Diffusion Models (Rombach et al., 2022)	Applied CLIP to image generation → LDM Paper Summary
Classifier-Free Diffusion Guidance (Ho & Salimans, 2022)	Guidance using CLIP embeddings → CFG Paper Summary
OpenCLIP	Open-source reimplementation of CLIP, trained on LAION-5B

Impact on AI image generation

CLIP is the foundation for every model that generates images from text. Stable Diffusion 1.x uses CLIP’s ViT-L/14 text encoder; SDXL and later use OpenCLIP.

The reason prompt phrasing so greatly affects image quality is that it depends on how CLIP vectorizes text. Without understanding CLIP, effective prompt design is impossible.

Prompt Basics — Word order rules in prompts based on CLIP
Prompt Design Thinking — Prompt design that leverages CLIP’s characteristics
CFG Paper Summary — Using CLIP embeddings for guidance
LDM Paper Summary — Model that integrates CLIP as text conditioning

What is it?

What makes it better than prior work?

Prior approach (ImageNet pre-training)

CLIP’s approach

What’s the core idea?

Contrastive Learning

Dual Encoder Architecture

WIT-400M Dataset

The Origin of the 75-Token Limit

Role in AI Image Generation

How was it validated?

Zero-Shot Evaluation

30+ Benchmarks

Are there limitations?

Limitations

Implications for Prompt Writing

What to read next

Impact on AI image generation

Related Articles

関連記事

[Paper Summary] Latent Diffusion Models (Stable Diffusion) | The Core Technology of AI Image Generation

[Paper Summary] Classifier-Free Diffusion Guidance | The Theory Behind Negative Prompts

Z-Image Turbo LoRA Training Guide | Create Custom LoRAs with AI Toolkit

AI Image Generation Roadmap for Beginners | From Zero to Your First Image

How to Quantitatively Evaluate Image Quality Using CLIP Score

AI Image Generation Resolution & Aspect Ratio Guide | How to Choose the Best Size