LatentLens Connectors

This repository contains trained MLP connector weights for the LatentLens project.

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs
Benno Krojer, Shravan Nayak, Oscar Mañas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach.

Resources: Paper | Code | Demo

What are these?

These are the trained connector (MLP projector) weights that map vision encoder outputs to LLM embedding space. The LLM and vision encoder weights are not included — they should be loaded from their original sources (OLMo, LLaMA, Qwen, CLIP, DINOv2, SigLIP).

Available Connectors

Connector	LLM	Vision Encoder	Size
`olmo-vit`	OLMo-7B	ViT-L/14-336 (CLIP)	347 MB
`olmo-dino`	OLMo-7B	DINOv2-L-336	347 MB
`olmo-siglip`	OLMo-7B	SigLIP-L	368 MB
`llama-vit`	LLaMA3-8B	ViT-L/14-336 (CLIP)	451 MB
`llama-dino`	LLaMA3-8B	DINOv2-L-336	451 MB
`llama-siglip`	LLaMA3-8B	SigLIP-L	479 MB
`qwen-vit`	Qwen2-7B	ViT-L/14-336 (CLIP)	557 MB
`qwen-dino`	Qwen2-7B	DINOv2-L-336	557 MB
`qwen-siglip`	Qwen2-7B	SigLIP-L	594 MB

Usage

from huggingface_hub import hf_hub_download

# Download a specific connector
connector_path = hf_hub_download(
    repo_id="McGill-NLP/latentlens-connectors",
    filename="olmo-vit/connector.pt"
)

# Load the weights
import torch
connector_weights = torch.load(connector_path, map_location="cpu")

For full usage with the LatentLens library:

from latentlens import LatentLens

model = LatentLens.load("olmo-vit")  # Downloads connector + base models automatically
results = model.analyze("image.jpg")

File Structure

Each connector folder contains:

connector.pt — Trained MLP weights (PyTorch state dict)
config.yaml — Training configuration (for reference)

Training Details

Training data: PixMo-Cap (image captioning)
Training: MLP-only (LLM and ViT frozen)
Steps: 12,000
Effective batch size: 32

Citation

@article{krojer2026latentlens,
  title={LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs},
  author={Krojer, Benno and Nayak, Shravan and Ma{\~n}as, Oscar and Adlakha, Vaibhav and Elliott, Desmond and Reddy, Siva and Mosbach, Marius},
  journal={arXiv preprint arXiv:2602.00462},
  year={2026}
}

License

Apache 2.0 (inherited from Molmo)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for McGill-NLP/latentlens-connectors

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

Paper • 2602.00462 • Published Jan 31 • 19