Image-Text-to-Text

LatentLens Connectors

This repository contains trained MLP connector weights for the LatentLens project.

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs
Benno Krojer, Shravan Nayak, Oscar Mañas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach.

Resources: Paper | Code | Demo

What are these?

These are the trained connector (MLP projector) weights that map vision encoder outputs to LLM embedding space. The LLM and vision encoder weights are not included — they should be loaded from their original sources (OLMo, LLaMA, Qwen, CLIP, DINOv2, SigLIP).

Available Connectors

Connector LLM Vision Encoder Size
olmo-vit OLMo-7B ViT-L/14-336 (CLIP) 347 MB
olmo-dino OLMo-7B DINOv2-L-336 347 MB
olmo-siglip OLMo-7B SigLIP-L 368 MB
llama-vit LLaMA3-8B ViT-L/14-336 (CLIP) 451 MB
llama-dino LLaMA3-8B DINOv2-L-336 451 MB
llama-siglip LLaMA3-8B SigLIP-L 479 MB
qwen-vit Qwen2-7B ViT-L/14-336 (CLIP) 557 MB
qwen-dino Qwen2-7B DINOv2-L-336 557 MB
qwen-siglip Qwen2-7B SigLIP-L 594 MB

Usage

from huggingface_hub import hf_hub_download

# Download a specific connector
connector_path = hf_hub_download(
    repo_id="McGill-NLP/latentlens-connectors",
    filename="olmo-vit/connector.pt"
)

# Load the weights
import torch
connector_weights = torch.load(connector_path, map_location="cpu")

For full usage with the LatentLens library:

from latentlens import LatentLens

model = LatentLens.load("olmo-vit")  # Downloads connector + base models automatically
results = model.analyze("image.jpg")

File Structure

Each connector folder contains:

  • connector.pt — Trained MLP weights (PyTorch state dict)
  • config.yaml — Training configuration (for reference)

Training Details

  • Training data: PixMo-Cap (image captioning)
  • Training: MLP-only (LLM and ViT frozen)
  • Steps: 12,000
  • Effective batch size: 32

Citation

@article{krojer2026latentlens,
  title={LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs},
  author={Krojer, Benno and Nayak, Shravan and Ma{\~n}as, Oscar and Adlakha, Vaibhav and Elliott, Desmond and Reddy, Siva and Mosbach, Marius},
  journal={arXiv preprint arXiv:2602.00462},
  year={2026}
}

License

Apache 2.0 (inherited from Molmo)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for McGill-NLP/latentlens-connectors