Abstract
Qianfan-OCR is a 4B-parameter vision-language model that unifies document parsing, layout analysis, and understanding while maintaining strong performance across multiple OCR benchmarks through its Layout-as-Thought mechanism.
We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.
Community
What is the advantage of using this vs. something like PaddleOCR-VL-1.5 which is a 0.9B model that scores 94.5 on OmniDocBench v1.5, and is likely faster at processing due to model size.
the one bit that actually changes how i think about end-to-end OCR here is the Layout-as-Thought module, where they first emit a compact layout dump (bounding boxes, element types, and reading order) via think tokens before producing the answer. that separation acts like a tiny, explicit grounding step that can salvage complex layouts without blowing up latency when unused. i wonder how much of the gains come from the structured prior vs the language-model decoding; an ablation where you generate the layout but skip the final answer, or generate the answer without the prior, would be revealing. btw the arXivLens breakdown did a solid job unpacking this, e.g. the think-token mechanism and the four-stage recipe: https://arxivlens.com/PaperView/Details/qianfan-ocr-a-unified-end-to-end-model-for-document-intelligence-5890-412528e9. if they extended think to a differentiable layout graph, could improve cross-page consistency and reduce brittle reading order assumptions.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GLM-OCR Technical Report (2026)
- Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting (2026)
- OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models (2026)
- GutenOCR: A Grounded Vision-Language Front-End for Documents (2026)
- Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding (2026)
- FireRed-OCR Technical Report (2026)
- Multimodal OCR: Parse Anything from Documents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper