Zero-Shot Image Classification
Transformers
PyTorch
English
clip
multimodal
language
vision
image-search
Instructions to use sujitpal/clip-imageclef with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sujitpal/clip-imageclef with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("zero-shot-image-classification", model="sujitpal/clip-imageclef") pipe( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png", candidate_labels=["animals", "humans", "landscape"], )# Load model directly from transformers import AutoProcessor, AutoModelForZeroShotImageClassification processor = AutoProcessor.from_pretrained("sujitpal/clip-imageclef") model = AutoModelForZeroShotImageClassification.from_pretrained("sujitpal/clip-imageclef") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| tags: | |
| - multimodal | |
| - language | |
| - vision | |
| - image-search | |
| - pytorch | |
| license: | |
| - mit | |
| metrics: | |
| - MRR | |
| ### Model Card: clip-imageclef | |
| ### Model Details | |
| [OpenAI CLIP model](https://openai.com/blog/clip/) fine-tuned using image-caption pairs from the [Caption Prediction dataset](https://www.imageclef.org/2017/caption) provided for the ImageCLEF 2017 competition. The model was evaluated using before and after fine-tuning, MRR@10 were 0.57 and 0.88 respectively. | |
| ### Model Date | |
| September 6, 2021 | |
| ### Model Type | |
| The base model is the OpenAI CLIP model. It uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. | |
| ### Fine-tuning | |
| The fine-tuning can be reproduced using code from the Github repository [elsevierlabs-os/clip-image-search](https://github.com/elsevierlabs-os/clip-image-search#fine-tuning). | |
| ### Usage | |
| ```python | |
| from transformers import CLIPModel, CLIPProcessor | |
| model = CLIPModel.from_pretrained("sujitpal/clip-imageclef") | |
| processor = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") | |
| inputs = processor(text=captions, images=images, | |
| return_tensors="pt", padding=True) | |
| output = model(**inputs) | |
| ``` | |
| ### Performance | |
| | Model-name | k=1 | k=3 | k=5 | k=10 | k=20 | | |
| | -------------------------------- | ----- | ----- | ----- | ----- | ----- | | |
| | zero-shot CLIP (baseline) | 0.426 | 0.534 | 0.558 | 0.573 | 0.578 | | |
| | clip-imageclef (this model) | 0.802 | 0.872 | 0.877 | 0.879 | 0.880 | | |