--- license: apache-2.0 library_name: pytorch pipeline_tag: sentence-similarity tags: - sentence-transformers - multimodal - embeddings - retrieval - image-text - audio-text - text-image-audio - tri-encoder - semantic-router - pytorch model-index: - name: multi-modal-embed-large results: - task: type: sentence-similarity dataset: name: Internal cached validation set type: cached_retrieval_validation metrics: - name: Eval loss type: eval_loss value: 0.389702 - name: Eval top1 type: eval_top1 value: 0.861707 --- # multi-modal-embed-large `multi-modal-embed-large` is the large production multimodal embedding model from the [llm-semantic-router](https://huggingface.co/llm-semantic-router) project. It is designed for routing, retrieval, and cross-modal matching across text, image, and audio rather than for generative chat. The model uses a tri-encoder architecture with separate text, image, and audio towers projected into one shared embedding space. ## Purpose This release exists to provide a large multimodal embedding model for production systems where inputs may arrive as text, screenshots or images, and audio. It is built for semantic routing, multimodal retrieval, and cross-modal similarity. ## What Is In This Repository This repository contains the minimum artifacts needed to load and run the exported model: - `model.pt`: trained weights for the final exported model - `config.json`: model configuration and encoder names - `src/hf_st_mm/...`: the Python source package used to construct and run the tri-encoder - `README.md`: this model card, including usage examples and validation summary This is not a generic Hugging Face Transformers checkpoint with a built-in auto-class loader. It is a packaged custom PyTorch model export. ## Advantages And Innovation Most multimodal models are optimized for generation, captioning, or chat. This model is optimized for embeddings and operational use. What is different here: - map text, image, and audio into one shared semantic space - support routing and retrieval instead of text generation - preserve a strong multilingual text backbone - use stronger modality-specific encoders instead of forcing every modality into one monolithic checkpoint - support production training and evaluation on cached shard datasets ## Model Overview This release packages the large routing-grade tri-encoder trained in PyTorch with the server training stack from this project. Architecture: - text encoder: `llm-semantic-router/mmbert-embed-32k-2d-matryoshka` - image encoder: `google/siglip2-so400m-patch14-384` - audio encoder: `openai/whisper-medium` - shared embedding dimension: `768` - max text length: `32768` Training characteristics: - objective: cached multiple negatives ranking loss - training stack: PyTorch + Accelerate - target hardware: AMD MI300X - data pipeline: cached tensor shards with sequential shard loading and worker-local prefetch ## How To Use It ## Installation ```bash pip install torch sentence-transformers transformers accelerate safetensors pillow librosa soundfile huggingface_hub ``` ## Python Usage The simplest way to use the model is to download the repository snapshot, load the packaged source code, and then encode one or more modality-tagged items. ```python import json import os import sys import torch from huggingface_hub import snapshot_download repo_id = "llm-semantic-router/multi-modal-embed-large" local_dir = snapshot_download(repo_id=repo_id) sys.path.insert(0, os.path.join(local_dir, "src")) from hf_st_mm.data import PairItem from hf_st_mm.model import MultiModalSentenceEmbedder with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as handle: cfg = json.load(handle) model = MultiModalSentenceEmbedder( text_encoder_name=cfg["model"]["text_encoder_name"], image_encoder_name=cfg["model"]["image_encoder_name"], audio_encoder_name=cfg["model"]["audio_encoder_name"], embedding_dim=int(cfg["model"]["embedding_dim"]), max_text_length=int(cfg["model"]["max_text_length"]), ) state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu") model.load_state_dict(state_dict) model.eval() items = [ PairItem(modality="text", value="route this request to the billing team"), PairItem(modality="image", value="/path/to/screenshot.png"), PairItem(modality="audio", value="/path/to/call.wav"), ] with torch.no_grad(): embeddings = model.encode_items(items) print(embeddings.shape) # [3, 768] import torch.nn.functional as F query = PairItem(modality="text", value="refund request for wrong charge") candidate = PairItem(modality="audio", value="/path/to/refund_call.wav") with torch.no_grad(): embs = model.encode_items([query, candidate]) similarity = F.cosine_similarity(embs[0:1], embs[1:2]).item() print(f"similarity={similarity:.4f}") ``` ## Validation Snapshot At upload time, the final export was evaluated with the repository's tri-encoder evaluator. - `eval_loss`: `0.389702` - `eval_top1`: `0.861707` ## Practical Notes - Text inputs can be provided as raw strings or tokenized features. - Image and audio inputs can be provided as file paths. - Cached tensor payloads are supported by the training stack, but the simplest inference path is to use file paths or raw text. - This release is intended for production retrieval and routing use cases rather than for instruction-following or caption generation. ## Limitations - This is a custom tri-encoder export, not a standard Transformers auto-class package. - Inference currently relies on the packaged `hf_st_mm` source code. - The validation metrics reported here come from the repository's cached retrieval validation path, not from a public benchmark leaderboard. ## Training Code Training and evaluation code live in the server training project that produced this checkpoint. - trainer: `scripts/train_st_multimodal.py` - evaluator: `scripts/evaluate_tri_encoder.py` - model: `src/hf_st_mm/model.py`