MiniCPM-o 4.5 - Wallfacer's open-source full-duplex, full-modal model

MiniCPM-o 4.5 is a wall-facing intelligent open source 9B parameter full-modal flagship model that adopts end-to-end architecture integration. SigLip2 ,Whisper, CosyVoice2 with Qwen3-8B. As the industry’s first model to support “instant free conversation”, the model realizes full-duplex interaction - it can watch, listen and speak at the same time, bidding farewell to the traditional turn-based “walkie-talkie” model. The model has leading visual understanding, super-anthropomorphic speech generation and sound cloningIt supports active interaction and real-time streaming media processing, and can be run on end-side devices. It has been adapted to various domestic chips such as Ascend and Haiguang, and can be deployed efficiently through frameworks such as llama.cpp and vLLM.

Main features of MiniCPM-o 4.5

Full duplex real-time interaction : The model can process visual and audio input and generate speech output at the same time, realizing parallel perception and expression of watching, listening and speaking.
Active intelligent interaction : The model autonomously monitors environmental changes once per second, actively determines when to speak, and implements human-like interactive behaviors such as active reminders and real-time comments.
Super-anthropomorphic speech synthesis : Supports end-to-end speech generation with full emotion and natural timbre. Customized sounds can be cloned based on a few seconds of audio samples, and long-term speech synthesis remains stable and consistent.
Leading visual understanding : In the OpenCompass evaluation, it surpassed GPT-4o and Gemini 2.0 Pro with 9B parameters, supporting high-resolution image analysis and high-frame-rate video real-time understanding.
End-to-end document parsing : Reaching the best level in the industry on the OmniDocBench benchmark, it can efficiently handle complex format English document understanding and structured extraction tasks.

Technical principles of MiniCPM-o 4.5

End-to-end full-modal architecture design : MiniCPM-o 4.5 uses the SigLip2 visual encoder, Whisper-medium audio encoder, CosyVoice2 speech decoder and Qwen3-8B language model for end-to-end joint training through dense feature connections. The tightly coupled design allows each modal information to flow freely within the model, avoiding information loss and error accumulation in the traditional pipeline architecture, and achieving more accurate multi-modal understanding and generation control.
Full-duplex multi-modal real-time streaming mechanism : The model transforms the offline modal codec into an online version that supports streaming input and output. The speech decoder uses text and speech token interleaved modeling to achieve full-duplex generation. During the inference process, the time-division multiplexing mechanism divides the parallel multi-modal data stream into sequential information groups within millisecond time slices, so that the language model backbone can be scheduled and processed in a unified manner, and the synchronous perception and response of real-time audio and video streams can be efficiently completed within a single architecture.
Active interactive decision-making mechanism : The language model module continuously monitors the input video stream and audio stream, and automatically triggers speech decisions at a frequency of 1Hz. The high-frequency decision-making capability combined with the full-duplex feature enables the model to independently select the most appropriate time and content to reply based on dynamic changes in the environment, breaking through the limitations of traditional models passively waiting for user instructions.
Configurable speech modeling design : The model continues the design paradigm of multi-modal system prompt words, and supports dual input of text system prompt words and audio system prompt words. The audio system prompt words are used to specify the target timbre characteristics. This design allows the model to only provide a short reference audio sample during the inference stage, allowing it to complete voice cloning and role playing.

MiniCPM-o 4.5 project address

GitHub repository ：https://github.com/OpenBMB/MiniCPM-o
HuggingFace model library ：https://huggingface.co/openbmb/MiniCPM-o-4_5
Experience Demo online ：https://huggingface.co/spaces/openbmb/minicpm-omni

Application scenarios of MiniCPM-o 4.5

Intelligent assistant and companionship : As an all-round AI assistant, the model can sense the user’s environment and emotions in real time, actively provide reminders, suggestions or emotional companionship, support personalized voice cloning, and create an exclusive interactive experience.
Real-time video interaction : Suitable for scenarios such as video surveillance analysis, live broadcast real-time commentary, remote teaching and tutoring, etc. It can simultaneously understand screen content and voice commands, and provide instant voice feedback.
Intelligent customer service and shopping guide : Provide natural and smooth voice services in e-commerce, finance, government affairs and other fields, support multiple rounds of dialogue and active recommendations, and improve user service experience and business conversion efficiency.
Education and training : Used for language learning training, virtual teachers, skills training, etc., to achieve immersive interactive teaching through the combination of visual demonstration and voice explanation.
Content creation and entertainment : Supports audiobook generation, virtual character dubbing, game NPC interaction, etc. The voice cloning function can quickly copy the voice of a specific character for role-playing. ©

← Previous SoulX-FlashTalk - Soul App's open-source real-time digital human generation model Next → Keling 3.0 Model - Kuaishou Keling's next-generation multimodal AI creation model

InternVL-U is a lightweight, unified multimodal model with 4B parameters, open-sourced by the Shanghai Artificial Intelligence Laboratory in collaboration with several top universities. It achieves an end-to-end closed loop of "understanding-reasoning-generation-editing" for the first time. The model employs three core designs: unified contextual modeling, modality-specific modularization, and decoupled visual representation, overcoming the bottlenecks of high training costs and uneven capabilities in traditional models. The model surpasses 14B-level models in complex scenarios such as text rendering, scientific reasoning, and spatial modeling. Its GenExam benchmark score of 22.9 for scientific image generation leads all open-source unified models, providing a significant advantage for scenarios such as scientific research and education, intelligent office work, and creative content creation.

Kilo CLI 1.0 - An open-source command-line tool from Kilo Code

Kilo CLI 1.0 is an open-source command-line tool from Kilo Code, designed specifically for agent engineering. Built on OpenCode, it supports over 500 AI models, allowing developers to freely choose models based on their task requirements.

Nanobot - An open-source personal AI assistant from the Data Science Lab at the University of Hong Kong

Nanobot is an ultra-lightweight personal AI assistant open-sourced by the Data Intelligence Laboratory at the University of Hong Kong. It fully replicates the core functionality of the OpenClaw agent in approximately 4,000 lines of code. Nanobot possesses capabilities such as web search, file operations, scheduled tasks, and a memory mechanism, supporting scenarios including 24/7 real-time market analysis, full-stack development, schedule management, and personal knowledge bases.

WorkAny Bot - A cloud-based AI agent tool based on the OpenClaw framework

WorkAny Bot is a cloud-based OpenClaw AI agent that supports 24/7 online work for users. WorkAny Bot supports integration with proprietary AI models such as GPT-4, Claude, and Tongyi Qianwen, and can communicate anytime through multiple channels including Telegram, Discord, Lark, and Slack.