Paper Info
- Title: Learning Transferable Visual Models From Natural Language Supervision
- Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, et al. (OpenAI)
- Published: ICML 2021
- arXiv: 2103.00020
When you write a prompt for AI image generation, why does English text get converted into an image? The answer lies in CLIP. The “75-token limit” mentioned in Prompt Basics also traces back to CLIP.
What is it?
CLIP is a model that learns the semantic correspondence between text and images.
- It can judge that the text “a photo of a cat” and an actual cat photo are “similar”
- It can select the image that best matches “a sunset over the ocean” from a large collection
It was trained with contrastive learning on 400 million text-image pairs (the WIT-400M dataset) collected from the internet.
In AI image generation (Stable Diffusion etc.), the text encoder portion of CLIP is used to convert prompts into vectors.
What makes it better than prior work?
Prior approach (ImageNet pre-training)
- Classification trained on a fixed set of 1,000 categories (dogs, cats, cars, etc.)
- New categories require additional training data and fine-tuning
- Cannot handle concepts not in the category set (e.g., “a smiling woman lit by a sunset”)
CLIP’s approach
- Flexible concept representation via natural language — any text can describe an image
- Zero-shot transfer — can recognize unseen categories using only text descriptions
- Matches ResNet-50 zero-shot accuracy on ImageNet — achieved without task-specific training
- Competitive performance on 30+ visual benchmarks (OCR, video action recognition, etc.)
What’s the core idea?
Contrastive Learning
CLIP’s key innovation is acquiring text-image correspondences through contrastive learning.
For a batch of N text-image pairs:
- Positives: Matching text-image pairs → maximize similarity
- Negatives: Non-matching pairs (N²-N pairs) → minimize similarity
This is optimized symmetrically (both text→image and image→text directions).
Dual Encoder Architecture
| Component | Role | Output |
|---|---|---|
| Text Encoder | Converts text to a vector | Text embedding (512 dimensions) |
| Image Encoder | Converts image to a vector | Image embedding (512 dimensions) |
- Text encoder: Transformer (GPT-based)
- Image encoder: Vision Transformer (ViT) or ResNet
Both outputs are mapped to a shared vector space and compared using cosine similarity.
WIT-400M Dataset
- 400 million text-image pairs collected from the internet
- More than 300× the scale of ImageNet’s 1.28 million images
- Covers diverse concepts (photos, illustrations, charts, memes, etc.)
The Origin of the 75-Token Limit
CLIP’s text encoder processes a maximum of 77 tokens (including BOS/EOS tokens). The practical limit is 75 tokens.
In Stable Diffusion models, prompts exceeding 75 tokens are split into chunks. This is the technical basis for the rule stated in Prompt Basics: “the first 75 tokens carry the most weight.”
Role in AI Image Generation
In text-to-image models like Stable Diffusion, CLIP is used as follows:
User's prompt (text)
↓
CLIP text encoder → text embedding vector
↓
Cross-Attention layers (inside the U-Net) condition the diffusion process
↓
Generated image
In other words, CLIP acts as a bridge between prompts and images.
How was it validated?
Zero-Shot Evaluation
Image classification was performed using only text prompts, with no additional training data.
| Benchmark | CLIP zero-shot | Supervised ResNet-50 |
|---|---|---|
| ImageNet | 76.2% | 76.1% |
| ImageNet-V2 | 70.1% | 63.3% |
| ImageNet-Sketch | 60.2% | 24.8% |
Particularly noteworthy is robustness to distribution shift. On ImageNet-V2 (slightly different conditions) and ImageNet-Sketch (sketch images), CLIP far outperforms the supervised model.
30+ Benchmarks
Evaluated across diverse tasks including OCR, satellite image recognition, video action recognition, and geolocalization. Achieves performance competitive with existing specialized models on many tasks.
Are there limitations?
Limitations
- Task-dependent performance variance: Still lags behind specialized models on fine-grained classification (e.g., distinguishing flower species)
- Prompt engineering required: Results vary greatly by how text is written (templates like “a photo of a {category}” are effective)
- Bias in training data: Data collected from the internet contains societal biases
- Weak on abstract concepts: Poor at handling quantity expressions like “an image with exactly 3 objects”
Implications for Prompt Writing
Since CLIP’s training data is web-collected text-image pairs, it tends to work well with expressions commonly found on the web (photo captions, product descriptions, etc.).
This is one reason why camera terms like professional photography, 85mm lens are effective in prompts — CLIP’s training data likely contains many photography-related captions.
What to read next
| Paper | Relevance |
|---|---|
| ALIGN (Jia et al., 2021) | Google’s similar vision-language pre-training. Scaling with noisy data |
| Vision Transformer (ViT) (Dosovitskiy et al., 2020) | Architecture used for CLIP’s image encoder |
| Latent Diffusion Models (Rombach et al., 2022) | Applied CLIP to image generation → LDM Paper Summary |
| Classifier-Free Diffusion Guidance (Ho & Salimans, 2022) | Guidance using CLIP embeddings → CFG Paper Summary |
| OpenCLIP | Open-source reimplementation of CLIP, trained on LAION-5B |
Impact on AI image generation
CLIP is the foundation for every model that generates images from text. Stable Diffusion 1.x uses CLIP’s ViT-L/14 text encoder; SDXL and later use OpenCLIP.
The reason prompt phrasing so greatly affects image quality is that it depends on how CLIP vectorizes text. Without understanding CLIP, effective prompt design is impossible.
Related Articles
- Prompt Basics — Word order rules in prompts based on CLIP
- Prompt Design Thinking — Prompt design that leverages CLIP’s characteristics
- CFG Paper Summary — Using CLIP embeddings for guidance
- LDM Paper Summary — Model that integrates CLIP as text conditioning
![[Paper Summary] CLIP | The AI Foundation Linking Text and Images](/papers/clip/cover.webp)
![[Paper Summary] Latent Diffusion Models (Stable Diffusion) | The Core Technology of AI Image Generation](/papers/latent-diffusion-models/cover.webp)
![[Paper Summary] Classifier-Free Diffusion Guidance | The Theory Behind Negative Prompts](/papers/classifier-free-diffusion-guidance/cover.webp)



