GLM-OCR - A lightweight multimodal OCR model from Zhipu Open Source
GLM-OCR is a lightweight multimodal OCR model open-sourced by Zhipu AI, with only 0.9B parameters in OmniDocBench...
GLM-OCR is a lightweight multi-modal OCR model open sourced by Zhipu AI. With only 0.9B parameters, it topped the SOTA list with a score of 94.6 on the OmniDocBench V1.5 list. The model is based on the GLM-V architecture, integrates the self-developed CogViT visual encoder and lightweight cross-modal connection layer, introduces multi-Token prediction loss and reinforcement learning training, and performs well in difficult scenarios such as handwriting, complex tables, code documents, seals, and multi-language mixing. The model supports HTML tables and JSON structured output, with an inference speed of 1.86 pages/second. It is compatible with vLLM/SGLang/Ollama deployment and is suitable for business scenarios such as document parsing, bill extraction, and RAG.
Main functions of GLM-OCR
- Universal text recognition : Supports photos, screenshots, scans, and PDFs, and recognizes special text such as print, handwriting, seals, and codes.
- Complex table parsing : Accurately understand structures such as merged cells and multi-layer table headers, and directly output HTML code without secondary tabulation.
- Structured information extraction : Intelligently extract key fields from cards, bills, and forms, output standard JSON format, and connect to business systems.
- Formula and code recognition : Supports accurate identification of professional technical content such as mathematical formulas and program codes.
- Multi-language and mixed-language support : Supports processing of vertical text, multi-language mixed layout and other complex layouts.
- Batch document processing : Supports large-volume document recognition, outputs regular formats, and provides a high-quality data foundation for RAG.
Technical principles of GLM-OCR
- overall architecture : GLM-OCR adopts the classic “encoder-decoder” architecture design, which is inherited from the GLM-V series. The architecture consists of three core modules: the CogViT visual encoder (400M parameter scale) on the visual side, the lightweight connection layer responsible for cross-modal information fusion, and the GLM-0.5B language decoder on the back end.
- visual coding : The visual encoder adopts the CogViT architecture developed by Zhipu, and introduces the CLIP contrastive learning strategy on billions of image-text pair data for large-scale pre-training. The model has powerful text detection and layout semantic understanding capabilities, and can effectively handle challenges such as multi-column layout, mixed graphics and text, and rotated text in complex documents.
- Cross-modal fusion : In order to achieve efficient integration of visual and language information, GLM-OCR designed a lightweight and efficient connection layer structure. Integrating the SwiGLU activation mechanism and introducing a 4x downsampling strategy, it can accurately screen and retain key visual tokens, efficiently compress and transfer high-density visual semantic information to the back-end language decoder, and support high-precision OCR recognition output.
- Training optimization : GLM-OCR is the first to introduce multi-Token prediction loss (MTP) into OCR model training in terms of training strategy. It enhances the loss signal density by simultaneously predicting multiple future tokens, significantly improving model learning efficiency. Through continuous and stable full-task reinforcement learning training, the overall recognition accuracy and cross-domain generalization capabilities of the model in complex document scenarios are further optimized.
- reasoning process : At the system level, GLM-OCR adopts the two-stage technical paradigm of “layout analysis → parallel recognition”. Perform document layout analysis based on PP-DocLayout-V3 to accurately locate text, tables, pictures and other areas; perform OCR recognition in parallel, and finally achieve stable, high-quality and efficient parsing effects in document scenarios with diverse layouts and complex structures.
GLM-OCR project address
- GitHub repository :https://github.com/zai-org/GLM-OCR
- HuggingFace model library :https://huggingface.co/zai-org/GLM-OCR
- Online experience :https://ocr.z.ai/
Application scenarios of GLM-OCR
- Education and scientific research : The model can accurately recognize handwritten notes, mathematical formulas, academic papers, and scanned textbooks, support complex typesetting and multi-language document processing, and assist knowledge organization and academic research.
- Corporate office : The model can automatically parse various documents such as contracts, invoices, reimbursement forms, meeting minutes, etc., realize digital archiving of paper documents, and greatly improve the efficiency of information entry.
- Financial insurance : Supports intelligent extraction of key fields in bank cards, ID cards, insurance policies, and bills, outputs structured JSON data, seamlessly connects to core business systems, and reduces manual review costs.
- Logistics Customs : Quickly identify professional documents such as customs declarations, waybills, and packing lists, accurately extract data such as product information, consignees and consignors, and amounts, and speed up customs clearance and settlement processes.
- software development : The model can accurately identify code screenshots, technical documents, and API manuals, and supports multiple programming languages, making it easier for developers to organize code snippets and build a technical knowledge base. ©