Model Overview

This model is introduced in the paper:

PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images
https://arxiv.org/abs/2509.25185

PixelCraft is a multi-agent framework designed for precise visual reasoning on structured images, with a focus on pixel-level grounding.

Intended Use

The model is intended for structured image understanding and grounding tasks, where accurate localization of visual elements is required to support downstream reasoning.

Inference

The reference inference implementation is provided in the PixelCraft repository.

The grounding and inference logic can be found at:

https://github.com/microsoft/PixelCraft/blob/main/src/tools/grounding.py

Please refer to this script for:

Model loading
Input preprocessing
Grounding and inference execution
Output formats

Users are expected to follow the provided implementation when running inference with this model.

Citation

If you find this work helpful in your research, please cite our paper:

@article{zhang2025pixelcraft,
  title={PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images},
  author={Zhang, Shuoshuo and Li, Zijian and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Zhang, Jun and Yang, Yujiu and Wang, Rui},
  journal={arXiv preprint arXiv:2509.25185},
  year={2025}
}

Downloads last month: 25

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zss01/PixelCraft-3B

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(618)

this model

Quantizations

1 model