[Paper Summary] CLIP | The AI Foundation Linking Text and Images

[Paper Summary] CLIP | The AI Foundation Linking Text and Images

Paper Info

  • Title: Learning Transferable Visual Models From Natural Language Supervision
  • Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, et al. (OpenAI)
  • Published: ICML 2021
  • arXiv: 2103.00020

When you write a prompt for AI image generation, why does English text get converted into an image? The answer lies in CLIP. The “75-token limit” mentioned in Prompt Basics also traces back to CLIP.

What is it?

CLIP is a model that learns the semantic correspondence between text and images.

  • It can judge that the text “a photo of a cat” and an actual cat photo are “similar”
  • It can select the image that best matches “a sunset over the ocean” from a large collection

It was trained with contrastive learning on 400 million text-image pairs (the WIT-400M dataset) collected from the internet.

In AI image generation (Stable Diffusion etc.), the text encoder portion of CLIP is used to convert prompts into vectors.

What makes it better than prior work?

Prior approach (ImageNet pre-training)

  • Classification trained on a fixed set of 1,000 categories (dogs, cats, cars, etc.)
  • New categories require additional training data and fine-tuning
  • Cannot handle concepts not in the category set (e.g., “a smiling woman lit by a sunset”)

CLIP’s approach

  • Flexible concept representation via natural language — any text can describe an image
  • Zero-shot transfer — can recognize unseen categories using only text descriptions
  • Matches ResNet-50 zero-shot accuracy on ImageNet — achieved without task-specific training
  • Competitive performance on 30+ visual benchmarks (OCR, video action recognition, etc.)

What’s the core idea?

Contrastive Learning

CLIP’s key innovation is acquiring text-image correspondences through contrastive learning.

For a batch of N text-image pairs:

  • Positives: Matching text-image pairs → maximize similarity
  • Negatives: Non-matching pairs (N²-N pairs) → minimize similarity

This is optimized symmetrically (both text→image and image→text directions).

Dual Encoder Architecture

ComponentRoleOutput
Text EncoderConverts text to a vectorText embedding (512 dimensions)
Image EncoderConverts image to a vectorImage embedding (512 dimensions)
  • Text encoder: Transformer (GPT-based)
  • Image encoder: Vision Transformer (ViT) or ResNet

Both outputs are mapped to a shared vector space and compared using cosine similarity.

WIT-400M Dataset

  • 400 million text-image pairs collected from the internet
  • More than 300× the scale of ImageNet’s 1.28 million images
  • Covers diverse concepts (photos, illustrations, charts, memes, etc.)

The Origin of the 75-Token Limit

CLIP’s text encoder processes a maximum of 77 tokens (including BOS/EOS tokens). The practical limit is 75 tokens.

In Stable Diffusion models, prompts exceeding 75 tokens are split into chunks. This is the technical basis for the rule stated in Prompt Basics: “the first 75 tokens carry the most weight.”

Role in AI Image Generation

In text-to-image models like Stable Diffusion, CLIP is used as follows:

User's prompt (text)
    ↓
CLIP text encoder → text embedding vector
    ↓
Cross-Attention layers (inside the U-Net) condition the diffusion process
    ↓
Generated image

In other words, CLIP acts as a bridge between prompts and images.

How was it validated?

Zero-Shot Evaluation

Image classification was performed using only text prompts, with no additional training data.

BenchmarkCLIP zero-shotSupervised ResNet-50
ImageNet76.2%76.1%
ImageNet-V270.1%63.3%
ImageNet-Sketch60.2%24.8%

Particularly noteworthy is robustness to distribution shift. On ImageNet-V2 (slightly different conditions) and ImageNet-Sketch (sketch images), CLIP far outperforms the supervised model.

30+ Benchmarks

Evaluated across diverse tasks including OCR, satellite image recognition, video action recognition, and geolocalization. Achieves performance competitive with existing specialized models on many tasks.

Are there limitations?

Limitations

  • Task-dependent performance variance: Still lags behind specialized models on fine-grained classification (e.g., distinguishing flower species)
  • Prompt engineering required: Results vary greatly by how text is written (templates like “a photo of a {category}” are effective)
  • Bias in training data: Data collected from the internet contains societal biases
  • Weak on abstract concepts: Poor at handling quantity expressions like “an image with exactly 3 objects”

Implications for Prompt Writing

Since CLIP’s training data is web-collected text-image pairs, it tends to work well with expressions commonly found on the web (photo captions, product descriptions, etc.).

This is one reason why camera terms like professional photography, 85mm lens are effective in prompts — CLIP’s training data likely contains many photography-related captions.

PaperRelevance
ALIGN (Jia et al., 2021)Google’s similar vision-language pre-training. Scaling with noisy data
Vision Transformer (ViT) (Dosovitskiy et al., 2020)Architecture used for CLIP’s image encoder
Latent Diffusion Models (Rombach et al., 2022)Applied CLIP to image generation → LDM Paper Summary
Classifier-Free Diffusion Guidance (Ho & Salimans, 2022)Guidance using CLIP embeddings → CFG Paper Summary
OpenCLIPOpen-source reimplementation of CLIP, trained on LAION-5B

Impact on AI image generation

CLIP is the foundation for every model that generates images from text. Stable Diffusion 1.x uses CLIP’s ViT-L/14 text encoder; SDXL and later use OpenCLIP.

The reason prompt phrasing so greatly affects image quality is that it depends on how CLIP vectorizes text. Without understanding CLIP, effective prompt design is impossible.