Model Overview

This model is introduced in the paper:

PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images
https://arxiv.org/abs/2509.25185

PixelCraft is a multi-agent framework designed for precise visual reasoning on structured images, with a focus on pixel-level grounding.

Intended Use

The model is intended for structured image understanding and grounding tasks, where accurate localization of visual elements is required to support downstream reasoning.

Inference

The reference inference implementation is provided in the PixelCraft repository.

The grounding and inference logic can be found at:

https://github.com/microsoft/PixelCraft/blob/main/src/tools/grounding.py

Please refer to this script for:

  • Model loading
  • Input preprocessing
  • Grounding and inference execution
  • Output formats

Users are expected to follow the provided implementation when running inference with this model.

Citation

If you find this work helpful in your research, please cite our paper:

@article{zhang2025pixelcraft,
  title={PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images},
  author={Zhang, Shuoshuo and Li, Zijian and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Zhang, Jun and Yang, Yujiu and Wang, Rui},
  journal={arXiv preprint arXiv:2509.25185},
  year={2025}
}
Downloads last month
25
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for zss01/PixelCraft-3B

Finetuned
(617)
this model