MiniCPM-o 4.5 - Wallfacer's open-source full-duplex, full-modal model

MiniCPM-o 4.5 is Wallfacer Intelligence's open-source 9B-parameter full-modal flagship model, employing an end-to-end architecture that integrates SigLip2, Whisper, CosyVoice2, and...

MiniCPM-o 4.5 - Wallfacer's open-source full-duplex, full-modal model

MiniCPM-o 4.5 is a wall-facing intelligent open source 9B parameter full-modal flagship model that adopts end-to-end architecture integration. SigLip2 ,Whisper, CosyVoice2 with Qwen3-8B. As the industry’s first model to support “instant free conversation”, the model realizes full-duplex interaction - it can watch, listen and speak at the same time, bidding farewell to the traditional turn-based “walkie-talkie” model. The model has leading visual understanding, super-anthropomorphic speech generation and sound cloningIt supports active interaction and real-time streaming media processing, and can be run on end-side devices. It has been adapted to various domestic chips such as Ascend and Haiguang, and can be deployed efficiently through frameworks such as llama.cpp and vLLM.

Main features of MiniCPM-o 4.5

  • Full duplex real-time interaction : The model can process visual and audio input and generate speech output at the same time, realizing parallel perception and expression of watching, listening and speaking.
  • Active intelligent interaction : The model autonomously monitors environmental changes once per second, actively determines when to speak, and implements human-like interactive behaviors such as active reminders and real-time comments.
  • Super-anthropomorphic speech synthesis : Supports end-to-end speech generation with full emotion and natural timbre. Customized sounds can be cloned based on a few seconds of audio samples, and long-term speech synthesis remains stable and consistent.
  • Leading visual understanding : In the OpenCompass evaluation, it surpassed GPT-4o and Gemini 2.0 Pro with 9B parameters, supporting high-resolution image analysis and high-frame-rate video real-time understanding.
  • End-to-end document parsing : Reaching the best level in the industry on the OmniDocBench benchmark, it can efficiently handle complex format English document understanding and structured extraction tasks.

Technical principles of MiniCPM-o 4.5

  • End-to-end full-modal architecture design : MiniCPM-o 4.5 uses the SigLip2 visual encoder, Whisper-medium audio encoder, CosyVoice2 speech decoder and Qwen3-8B language model for end-to-end joint training through dense feature connections. The tightly coupled design allows each modal information to flow freely within the model, avoiding information loss and error accumulation in the traditional pipeline architecture, and achieving more accurate multi-modal understanding and generation control.
  • Full-duplex multi-modal real-time streaming mechanism : The model transforms the offline modal codec into an online version that supports streaming input and output. The speech decoder uses text and speech token interleaved modeling to achieve full-duplex generation. During the inference process, the time-division multiplexing mechanism divides the parallel multi-modal data stream into sequential information groups within millisecond time slices, so that the language model backbone can be scheduled and processed in a unified manner, and the synchronous perception and response of real-time audio and video streams can be efficiently completed within a single architecture.
  • Active interactive decision-making mechanism : The language model module continuously monitors the input video stream and audio stream, and automatically triggers speech decisions at a frequency of 1Hz. The high-frequency decision-making capability combined with the full-duplex feature enables the model to independently select the most appropriate time and content to reply based on dynamic changes in the environment, breaking through the limitations of traditional models passively waiting for user instructions.
  • Configurable speech modeling design : The model continues the design paradigm of multi-modal system prompt words, and supports dual input of text system prompt words and audio system prompt words. The audio system prompt words are used to specify the target timbre characteristics. This design allows the model to only provide a short reference audio sample during the inference stage, allowing it to complete voice cloning and role playing.

MiniCPM-o 4.5 project address

Application scenarios of MiniCPM-o 4.5

  • Intelligent assistant and companionship : As an all-round AI assistant, the model can sense the user’s environment and emotions in real time, actively provide reminders, suggestions or emotional companionship, support personalized voice cloning, and create an exclusive interactive experience.
  • Real-time video interaction : Suitable for scenarios such as video surveillance analysis, live broadcast real-time commentary, remote teaching and tutoring, etc. It can simultaneously understand screen content and voice commands, and provide instant voice feedback.
  • Intelligent customer service and shopping guide : Provide natural and smooth voice services in e-commerce, finance, government affairs and other fields, support multiple rounds of dialogue and active recommendations, and improve user service experience and business conversion efficiency.
  • Education and training : Used for language learning training, virtual teachers, skills training, etc., to achieve immersive interactive teaching through the combination of visual demonstration and voice explanation.
  • Content creation and entertainment : Supports audiobook generation, virtual character dubbing, game NPC interaction, etc. The voice cloning function can quickly copy the voice of a specific character for role-playing. ©