Qianfan-OCR - Baidu Qianfan's end-to-end intelligent document model

Qianfan-OCR is an end-to-end intelligent document model launched by Baidu Qianfan. Based on a 4B parameter visual language architecture, it integrates document parsing, layout analysis, text recognition, and semantic understanding. The model ranked first among end-to-end models in the OmniDocBench v1.5 benchmark with a score of 93.12. It achieves explicit modeling of page structure through a Layout-as-Thought mechanism and supports the understanding of complex tables and charts. The model is open source and can be efficiently deployed with a single A100 GPU. Qianfan-OCR's main functions...

Qianfan-OCR - Baidu Qianfan's end-to-end intelligent document model

Qianfan-OCR is an end-to-end document intelligence model launched by Baidu Qianfan. It is based on the 4B parameter visual language architecture and integrates document parsing, layout analysis, text recognition and semantic understanding. The model ranked first among end-to-end models in the OmniDocBench v1.5 evaluation with a score of 93.12. It implements explicit modeling of layout structure through the Layout-as-Thought mechanism and supports the understanding of complex tables and charts. The model is open source and can be deployed efficiently with a single-card A100.

The main functions of Qianfan-OCR

  • Document image parsing : The model supports extracting structured text content directly from scans or pictures without preprocessing.
  • Layout analysis and understanding : Supports automatic identification of titles, paragraphs, tables, charts and other elements in documents and their spatial relationships.
  • text recognition conversion : Accurately convert printed or handwritten text in images into editable text.
  • Key information extraction : Support locating and extracting specific fields from complex documents, such as date, amount, person name, etc.
  • Chart reasoning analysis : The model can understand the numerical meaning and trend of visual content such as bar charts and line charts.
  • Multiple format output : Supports generating structured data formats such as Markdown, JSON, and HTML.

Technical principles of Qianfan-OCR

  • End-to-end unified architecture : Qianfan-OCR adopts a unified visual language architecture and replaces the traditional “detection-recognition-understanding” multi-stage pipeline with an end-to-end approach. The model directly maps document images into structured output, avoiding error accumulation and loss of visual information caused by staged processing.
  • Layout-as-Thought mechanism : In response to the lack of explicit layout modeling in the end-to-end model, the team launched the Layout-as-Thought mechanism. Before outputting the final result, the model passes “ Token triggers the structural thinking stage. It first generates layout information such as element position, type, and reading order. Based on this prior knowledge, content analysis is completed, and it has both structural perception and semantic understanding capabilities within a unified framework.

Key information and usage requirements of Qianfan-OCR

  • Model size : 4B Parametric Visual Language Architecture
  • Evaluation results : OmniDocBench v1.5 ranks first in end-to-end model (93.12 points), surpassing Gemini 3-Pro in KIE list
  • core innovation : Adopt Layout-as-Thought mechanism to support explicit modeling of layout structure
  • Deployment performance : Single A100 GPU (W8A8 quantized) throughput 1.024 pages/second
  • Open source status : Model weights have been released to HuggingFace, supporting the Skills tool chain
  • Hardware environment : It is recommended to equip NVIDIA A100 or equivalent GPU for inference deployment.
  • Software dependencies : The vLLM inference framework needs to be installed to support W8A8 quantization to reduce video memory usage.
  • Access method : Called online through Baidu Qianfan platform, or deployed privately based on open source weights
  • Input format : Support common document image formats (PDF, PNG, JPG, etc.)
  • Output format : Configure structured output such as Markdown, JSON, HTML, etc. according to needs

Qianfan-OCR’s core advantages

  • Leading architecture : The model adopts an end-to-end unified visual language architecture, replacing the traditional multi-stage Pipeline, eliminating error accumulation between modules and greatly simplifying system deployment and operation and maintenance complexity.
  • Layout understanding : Original Layout-as-Thought mechanism, through “ Token explicitly models the position, type and reading order of document elements, significantly improving the parsing accuracy of complex typesetting scenarios.
  • Top performance : Ranked first in the end-to-end model with a score of 93.12 in the OmniDocBench v1.5 evaluation, and achieved 5 best results in 6 tasks including chart understanding.
  • Outstanding efficiency : A single A100 GPU combined with W8A8 quantization can achieve a throughput of 1.024 pages/second. Compared with traditional solutions, it saves the cost of CPU detection and multi-model heterogeneous orchestration.
  • Ready out of the box : Supports online calling of Baidu Qianfan platform, HuggingFace open source weight privatization deployment, and provides a complete Skills tool chain and multi-format output capabilities.

How to use Qianfan-OCR

  • Online call : Visit the Baidu Qianfan platform console, select the Qianfan-OCR built-in model in the model center, create an application to obtain the API Key, and upload the document image through the standard HTTP interface to obtain the structured parsing results in real time.
  • Private deployment Download the open source model weights from HuggingFace, install the vLLM inference framework and configure W8A8 quantization parameters, start the model service on a server equipped with A100 GPU, and implement offline calls through local APIs.
  • Toolchain integration Clone the official GitHub Skills repository, conduct secondary development based on the provided document intelligence toolkit, embed OCR capabilities into existing business systems, and support customized output formats and batch document processing processes.

Qianfan-OCR project address

Comparison of similar competing products of Qianfan-OCR

Contrast DimensionsQianfan-OCRGPT-4oGemini 3-Pro
Architecture designEnd-to-end unified visual language architecture (4B parameters)General multimodal large modelGeneral multimodal large model
OmniDocBench v1.593.12 points (first in end-to-end)Unpublished special evaluationUnpublished special evaluation
Layout analysis abilityLayout-as-Thought explicit modelingImplicit understanding, no structured outputImplicit understanding, no structured output
Diagram understanding5 best of 6 tasksStrong general reasoning skillsStrong general reasoning skills
Deployment costA single card A100 can runNeed to call cloud APINeed to call cloud API
Open source levelModel weight + paper + skills are all open sourceClosed source commercial APIClosed source commercial API
Output formatMarkdown/JSON/HTML structured outputnatural language descriptionnatural language description

Application scenarios of Qianfan-OCR

  • Digitization of corporate documents : Supports batch processing of scanned copies of contracts, invoices, reports, etc., automatically extracts key fields and generates a structured database.
  • Financial instrument review : The model can identify the amount, date, and account information in bank statements, insurance policies, and statements, and assist in risk control and compliance review.
  • Medical record management : Analyze symptom, diagnosis, and medication records in handwritten or printed medical records to achieve rapid archiving and retrieval of electronic medical records.
  • Academic paper processing : Convert PDF documents to Markdown format, retaining formulas, charts and reference structures to facilitate knowledge base construction.
  • Archive historical document restoration : The model can recognize text in low-quality images such as ancient books and old newspapers, and assist in the digital protection of cultural heritage. ©