zhoubolei/scene_parse_150
Updated • 2.22k • 31
This model is a UPerNet semantic segmentation model with a ALiBi-ViT (Vision Transformer with Rotary Position Embeddings) backbone, trained on the ADE20K dataset.
| Metric | Value |
|---|---|
| mIoU | 23.49% |
| mAcc | 32.01% |
| aAcc | 70.84% |
from mmseg.apis import init_model, inference_model
config_file = 'upernet_alibi_vit_tiny_512x512_ade20k.py'
checkpoint_file = 'best_mIoU_iter_40000.pth'
# Initialize the model
model = init_model(config_file, checkpoint_file, device='cuda:0')
# Inference on an image
result = inference_model(model, 'demo.jpg')
The model was trained with the following configuration:
If you use this model, please cite:
@misc{rope-vit-segmentation,
author = {VLG IITR},
title = {UPerNet with ALiBi-ViT for Semantic Segmentation},
year = {2026},
publisher = {Hugging Face},
}
This model is released under the Apache 2.0 license.