Model Overview
This model is introduced in the paper:
PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images
https://arxiv.org/abs/2509.25185
PixelCraft is a multi-agent framework designed for precise visual reasoning on structured images, with a focus on pixel-level grounding.
Intended Use
The model is intended for structured image understanding and grounding tasks, where accurate localization of visual elements is required to support downstream reasoning.
Inference
The reference inference implementation is provided in the PixelCraft repository.
The grounding and inference logic can be found at:
https://github.com/microsoft/PixelCraft/blob/main/src/tools/grounding.py
Please refer to this script for:
- Model loading
- Input preprocessing
- Grounding and inference execution
- Output formats
Users are expected to follow the provided implementation when running inference with this model.
Citation
If you find this work helpful in your research, please cite our paper:
@article{zhang2025pixelcraft,
title={PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images},
author={Zhang, Shuoshuo and Li, Zijian and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Zhang, Jun and Yang, Yujiu and Wang, Rui},
journal={arXiv preprint arXiv:2509.25185},
year={2025}
}
- Downloads last month
- 25
Model tree for zss01/PixelCraft-3B
Base model
Qwen/Qwen2.5-VL-3B-Instruct