Qianfan-OCR - Baidu Qianfan's end-to-end intelligent document model

Qianfan-OCR is an end-to-end document intelligence model launched by Baidu Qianfan. It is based on the 4B parameter visual language architecture and integrates document parsing, layout analysis, text recognition and semantic understanding. The model ranked first among end-to-end models in the OmniDocBench v1.5 evaluation with a score of 93.12. It implements explicit modeling of layout structure through the Layout-as-Thought mechanism and supports the understanding of complex tables and charts. The model is open source and can be deployed efficiently with a single-card A100.

The main functions of Qianfan-OCR

Document image parsing : The model supports extracting structured text content directly from scans or pictures without preprocessing.
Layout analysis and understanding : Supports automatic identification of titles, paragraphs, tables, charts and other elements in documents and their spatial relationships.
text recognition conversion : Accurately convert printed or handwritten text in images into editable text.
Key information extraction : Support locating and extracting specific fields from complex documents, such as date, amount, person name, etc.
Chart reasoning analysis : The model can understand the numerical meaning and trend of visual content such as bar charts and line charts.
Multiple format output : Supports generating structured data formats such as Markdown, JSON, and HTML.

Technical principles of Qianfan-OCR

End-to-end unified architecture : Qianfan-OCR adopts a unified visual language architecture and replaces the traditional “detection-recognition-understanding” multi-stage pipeline with an end-to-end approach. The model directly maps document images into structured output, avoiding error accumulation and loss of visual information caused by staged processing.
Layout-as-Thought mechanism : In response to the lack of explicit layout modeling in the end-to-end model, the team launched the Layout-as-Thought mechanism. Before outputting the final result, the model passes “ Token triggers the structural thinking stage. It first generates layout information such as element position, type, and reading order. Based on this prior knowledge, content analysis is completed, and it has both structural perception and semantic understanding capabilities within a unified framework.

Key information and usage requirements of Qianfan-OCR

Model size : 4B Parametric Visual Language Architecture
Evaluation results : OmniDocBench v1.5 ranks first in end-to-end model (93.12 points), surpassing Gemini 3-Pro in KIE list
core innovation : Adopt Layout-as-Thought mechanism to support explicit modeling of layout structure
Deployment performance : Single A100 GPU (W8A8 quantized) throughput 1.024 pages/second
Open source status : Model weights have been released to HuggingFace, supporting the Skills tool chain
Hardware environment : It is recommended to equip NVIDIA A100 or equivalent GPU for inference deployment.
Software dependencies : The vLLM inference framework needs to be installed to support W8A8 quantization to reduce video memory usage.
Access method : Called online through Baidu Qianfan platform, or deployed privately based on open source weights
Input format : Support common document image formats (PDF, PNG, JPG, etc.)
Output format : Configure structured output such as Markdown, JSON, HTML, etc. according to needs

Qianfan-OCR’s core advantages

Leading architecture : The model adopts an end-to-end unified visual language architecture, replacing the traditional multi-stage Pipeline, eliminating error accumulation between modules and greatly simplifying system deployment and operation and maintenance complexity.
Layout understanding : Original Layout-as-Thought mechanism, through “ Token explicitly models the position, type and reading order of document elements, significantly improving the parsing accuracy of complex typesetting scenarios.
Top performance : Ranked first in the end-to-end model with a score of 93.12 in the OmniDocBench v1.5 evaluation, and achieved 5 best results in 6 tasks including chart understanding.
Outstanding efficiency : A single A100 GPU combined with W8A8 quantization can achieve a throughput of 1.024 pages/second. Compared with traditional solutions, it saves the cost of CPU detection and multi-model heterogeneous orchestration.
Ready out of the box : Supports online calling of Baidu Qianfan platform, HuggingFace open source weight privatization deployment, and provides a complete Skills tool chain and multi-format output capabilities.

How to use Qianfan-OCR

Online call : Visit the Baidu Qianfan platform console, select the Qianfan-OCR built-in model in the model center, create an application to obtain the API Key, and upload the document image through the standard HTTP interface to obtain the structured parsing results in real time.
Private deployment Download the open source model weights from HuggingFace, install the vLLM inference framework and configure W8A8 quantization parameters, start the model service on a server equipped with A100 GPU, and implement offline calls through local APIs.
Toolchain integration Clone the official GitHub Skills repository, conduct secondary development based on the provided document intelligence toolkit, embed OCR capabilities into existing business systems, and support customized output formats and batch document processing processes.

Qianfan-OCR project address

GitHub repository ：https://github.com/baidubce/Qianfan-VL
HuggingFace model library ：https://huggingface.co/baidu/Qianfan-OCR
arXiv technical papers ：https://arxiv.org/pdf/2603.13398

Comparison of similar competing products of Qianfan-OCR

Contrast Dimensions	Qianfan-OCR	GPT-4o	Gemini 3-Pro
Architecture design	End-to-end unified visual language architecture (4B parameters)	General multimodal large model	General multimodal large model
OmniDocBench v1.5	93.12 points (first in end-to-end)	Unpublished special evaluation	Unpublished special evaluation
Layout analysis ability	Layout-as-Thought explicit modeling	Implicit understanding, no structured output	Implicit understanding, no structured output
Diagram understanding	5 best of 6 tasks	Strong general reasoning skills	Strong general reasoning skills
Deployment cost	A single card A100 can run	Need to call cloud API	Need to call cloud API
Open source level	Model weight + paper + skills are all open source	Closed source commercial API	Closed source commercial API
Output format	Markdown/JSON/HTML structured output	natural language description	natural language description

Application scenarios of Qianfan-OCR

Digitization of corporate documents : Supports batch processing of scanned copies of contracts, invoices, reports, etc., automatically extracts key fields and generates a structured database.
Financial instrument review : The model can identify the amount, date, and account information in bank statements, insurance policies, and statements, and assist in risk control and compliance review.
Medical record management : Analyze symptom, diagnosis, and medication records in handwritten or printed medical records to achieve rapid archiving and retrieval of electronic medical records.
Academic paper processing : Convert PDF documents to Markdown format, retaining formulas, charts and reference structures to facilitate knowledge base construction.
Archive historical document restoration : The model can recognize text in low-quality images such as ancient books and old newspapers, and assist in the digital protection of cultural heritage. ©