Whitened CLIP as a Likelihood Surrogate of Images and Captions

ICML 2025

Abstract

Whitened CLIP proposes a training-free, invertible linear whitening of CLIP embeddings so that features have zero mean and identity covariance, making the transformed embeddings approximately standard normal. This enables a simple likelihood surrogate: the log-likelihood is proportional to the squared Euclidean norm in the whitened space. The paper validates Gaussianity with statistical tests and demonstrates applications for both images and captions, including identifying generation with artifacts and ranking distribution shifts and text complexity.

WCLIP teaser figure

Key Insights

Whitening

Conditional likelihood diagram

We introduce Whitened CLIP: an invertible, training-free linear whitening of CLIP embeddings that enforces zero mean and identity covariance.

Likelihood Approximation

Whitening diagram

In the whitened space, embeddings are well-approximated by a standard normal distribution, so likelihood becomes simple: higher norms ⇒ lower likelihood (and vice versa).

Statistical Validation

Domain adaptation diagram

We empirically validate the Gaussian behavior using normality tests (e.g., Anderson–Darling and D’Agostino–Pearson), and correlation measurements supporting the likelihood interpretation.

Applications

Deployment diagram

The resulting likelihood surrogate works for both images and captions, enabling practical diagnostics such as artifact discovery in generated images and generative bias in generators, ranking distribution shifts and measuring text complexity.

BibTeX

@inproceedings{betser2026whitened,
  title         = {Whitened CLIP as a Likelihood Surrogate of Images and Captions},
  author        = {Betser, Roy and Levi, Meir Yossef and Gilboa, Guy},
  booktitle     = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
  year          = {2025},
  organization  = {PMLR},
}