VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
Paper • 2602.17807 • Published • 7
How to use tue-mps/videomt-dinov2-small-ytvis2019 with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("image-segmentation", model="tue-mps/videomt-dinov2-small-ytvis2019") # Load model directly
from transformers import AutoTokenizer, VideomtForUniversalSegmentation
tokenizer = AutoTokenizer.from_pretrained("tue-mps/videomt-dinov2-small-ytvis2019")
model = VideomtForUniversalSegmentation.from_pretrained("tue-mps/videomt-dinov2-small-ytvis2019")This repository contains the Hugging Face Transformers conversion of the official VidEoMT checkpoint
yt_2019_vit_small_52.8.pth from tue-mps/VidEoMT.
| Metric | Value |
|---|---|
| AP | 52.8 |
| AR@10 | 62.2 |
| FPS | 294 |
The metrics above are the numbers reported by the authors in the official model zoo.
from transformers import AutoModelForUniversalSegmentation, AutoVideoProcessor
model_id = "tue-mps/videomt-dinov2-small-ytvis2019"
processor = AutoVideoProcessor.from_pretrained(model_id)
model = AutoModelForUniversalSegmentation.from_pretrained(model_id)
Use processor.post_process_instance_segmentation,
processor.post_process_panoptic_segmentation, or
processor.post_process_semantic_segmentation depending on the target task.