Canonicalizing Multimodal Contrastive Representation Learning

Sharut Gupta^†,^*

Sanyam Kansal^‡,^*

Stefanie Jegelka^†,^§

Phillip Isola^†

Vikas Garg^¶,^$\circ$

^*Equal contribution
^† MIT CSAIL	^‡ IIT Kanpur	^§ TU Munich	^¶ Aalto University	^$\circ$ YaiYai Ltd.

TLDR

We show that the map between any two independently trained multimodal contrastive models can be well approximated by a orthogonal map, that is shared across modalities and learnable from only a few data points in one (images or text) modality.

Motivation and Intuition

Consider two multimodal contrastive models $\mathcal{M} = (f, g)$ and $\tilde{\mathcal{M}} = (\tilde{f}, \tilde{g})$, trained in complete isolation on different datasets, with different architectures, initializations, and modeling choices. Due to optimization stochasticity and training differences, the embedding spaces of $\mathcal M$ and $\tilde{\mathcal M}$ are a priori incomparable. Even with identical data, jointly rotating both embeddings by any orthogonal matrix leaves the loss and all within-model similarities unchanged. Architectural mismatch, finite-sample noise, and optimization effects further amplify this ambiguity; under distribution shift, the models may not even share the same population optimum. So we ask:

Key Question

Given two independently trained multimodal models, does a systematic geometric relationship exist between their embedding spaces? If so, what is its form, and how does it differ across modalities?

Despite the modality gap¹ and disjoint supports of the two models, we argue that the alignment problem is indeed solvable because relative geometry is remarkably stable. While the absolute coordinates of the embedding cones^1,² shift arbitrarily between models, the angular arrangement of the texts with respect to the images remains consistent. Mathematically, this means that the multimodal kernels are approximately preserved across models: $\langle f, g \rangle \approx \langle \tilde{f}, \tilde{g} \rangle$. Strikingly, this preservation of multimodal kernels is a sufficient condition to constrain the functional form of map between the two models, forcing it to be an isometry.

Cross-modal kernels preserved across models — Figure: Across CLIP variants, the multimodal kernel (relative angles between image and text embeddings) is strongly preserved (dashed lines)); (b) CKA on multimodal kernels shows high alignment across models.

Theoretical Insights

Theoretically, we prove the following:

Agreement of Multimodal Kernels. Under mild assumptions on the data curation process of the two models, the induced multimodal kernels can agree up to a constant factor.
Identifiability of the Orthogonal Map. If these multimodal kernels agree on a sufficiently rich but small finite set of anchors across the two models, then there exists a single global orthogonal map that aligns the image representations across models, and the same map simultaneously aligns the text representations across models.
Generalization of the Orthogonal Map. The above result also generalizes to settings where the multimodal kernels agree only approximately.

Theoretical insights — Figure: Theoretically, we show that if the multimodal kernels induced by two contrastive models agree on a sufficiently rich but *small finite* set of anchors, a single global orthogonal map aligns their representations across both modalities.

Experimental Findings

We evaluate three independently trained vision-language pairs: (i) CLIP ViT-B/32 (OpenAI) and CLIP ViT-B/32 trained on LAION-400M; (ii) CLIP ViT-L/14 (OpenAI) and SigLIP; (iii) CLIP ViT-L/14 (OpenAI) and FLAVA on the datasets: Oxford-IIIT Pets, CIFAR-100, Caltech-101, STL10 and DTD. We use orthogonal Procrustes solution to estimate the orthogonal map $\mathcal Q$. In practice, the two models can differ by a constant offset in embedding space due to finite-sample effects. We therefore fit and apply $\mathcal Q$ on centered embeddings, and then re-add the target mean. We report the following evaluation metrics:

Instance level cosine similarity, measured between aligned and target embeddings for either images or texts;
Top-1 retrieval across models within a modality, evaluated for both image–image (called Image retrieval accuracy) and text–text (called Text retrieval accuracy) by nearest-neighbor matching at the class level.
Zero-shot classification, measured for transformed images against target text (denoted aligned image–text), target images against transformed text (image–aligned text), and both transformed images and transformed text (aligned image–aligned text).

Across all experiments, we fit the map $\mathcal Q$ using only images i.e $\tilde{f}(x) \approx Q f(x)$ and test it across both modalities.

1. Independently Trained Contrastive Models Differ by an Orthogonal Map Common To Both Modalities

Alternative Alignment Maps Than The Orthogonal Mapping

(a) An Orthogonal Map Aligns Different Models. We first observe a that a single orthogonal map almost perfectly aligns image embeddings across distinct models by significantly improving the image-image cosine and retaining high aligned-image-to-text accuracy.

(b) This Map Transfers Across Modalities. The same orthogonal map $\mathcal Q$ fit using images sharply improves text alignment, boosting pointwise text-text cosine from near zero to $\sim 0.6 - 0.8$.

Finally, image-to-aligned-text retrieval remains strong, showing that $\mathcal Q$ preserves task-relevant geometry while eliminating any need to compute the second model's text embeddings.

2. Only a Few Data Points Are Needed to Learn the Orthogonal Map

Theoretically, we proved if the multimodal kernels induced by two contrastive models agree on a sufficiently rich but small finite set of anchors, a single global orthogonal map aligns their representations across both modalities. We empirically validate this, fitting $\mathcal Q$ using images from only $N$ classes and evaluating transfer on the remaining unseen classes. Performance on both seen and unseen classes improves quickly with just a few anchor classes and saturates around 10-15 classes, after which additional anchors provide little benefit. Thus, practitioners can recover near-full cross-model transfer by fitting $\mathcal Q$ on a lightweight image-only calibration set, rather than curating large-scale cross-model supervision.

3. Alternative Alignment Maps Than The Orthogonal Mapping

Here, we ablate the alignment design by comparing three maps of increasing expressiveness: (i) an orthogonal map $Q$, (ii) a linear map, and (iii) a non-linear MLP. More expressive maps improve pointwise image-image cosine similarity. However, they transfer poorly to the secondary modality (text in this case) and fail to preserve image–text geometry. In contrast, the orthogonal map consistently performs best on both text-text cosine similarity and geometry-sensitive downstream metrics.

4. Commuting Diagrams Across Models and Modalities

Both routes (direct and text mediated maps) yield highly consistent semantic neighborhoods: images retrieved via the text-mediated route closely match those obtained by direct image alignment. This indicates that $\mathcal Q$ approximately commutes with the cross-modal nearest-neighbor operators, allowing to move across modalities and models while preserving semantic relationships.

Hover over the arrows

$x_A$ $y_A$ $\tilde{x}_B$ $\tilde{y}_B$ $\mathcal{Q}$ $f_A$ $\mathcal{Q}$ $f_B^{-1}$

Direct Map $\mathcal{Q}$ between $x_A$ and $\tilde{x}_B$ between models $A$ and $B$

Text Mediated Map $f_B^{-1} \circ \mathcal{Q} \circ f_A$ between models $A$ and $B$

Discussion and Implications

Our results have several practical and scientific implications.

Re-embedding is not necessary. In large embedding systems, switching models typically triggers full re-embedding, often infeasible at modern scale (billions of vectors) and costly in both time and compute. We show that a small anchor set can recover the orthogonal map that restores compatibility across models. Since it preserves inner products, it supports model upgrades without re-encoding while keeping the embedding geometry intact.
Mix-and-Match Models. Models often specialize differently; one might have a stronger vision tower, while another has a stronger or multilingual text tower. Our approach lets practitioners swap and combine towers while preserving image-text geometry.
Privacy and Security. Many deployments cannot retain or share raw text (privacy, licensing, retention), yet they store embeddings. Aligning text representations without accessing text has key implications for governance and security. If embeddings across models and modalities are easily transformable, then stored embeddings may encode more transferable semantic information than anticipated, reinforcing the need to treat embeddings as sensitive artifacts.

References

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. W. Liang, Y. Zhang, Y. Kwon, S. Yeung, and J. Zou, NeurIPS 2022.
Understanding and Fixing the Modality Gap in Vision-Language Models. V.Udandarao, Master’s thesis, University of Cambridge, 2022.

To cite this work, please use the following bibtex:

@inproceedings{sharut2026canonicalizing,
    title={Canonicalizing Multimodal Contrastive Representation Learning},
    author={Gupta, Sharut and Kansal, Sanyam and Jegelka, Stefanie and 
        Isola, Phillip and Garg, Vikas},
    journal={arXiv preprint arXiv:2602.17584},
    year={2026}
}