Xiaomi MiMo-V2-Omni - Xiaomi's All-Modal Agent Base Model

Xiaomi MiMo-V2-Omni is a multi-modal agent foundation model launched by Xiaomi, integrating text, vision, and speech modalities, and natively possessing perception, reasoning, and execution capabilities. The model supports tool invocation, GUI operation, and autonomous planning for complex tasks, and in evaluations of audio understanding and image reasoning, it rivals the Gemini 3 Pro and Claude Opus 4.6. The model, previously anonymously tested under the codename "Healer Alpha," topped the OpenRouter invocation leaderboard and has now become a core component of Xiaomi's approach to the agent era...

Xiaomi MiMo-V2-Omni - Xiaomi's All-Modal Agent Base Model

Xiaomi MiMo-V2-Omni is a full-modal Agent base model launched by Xiaomi. It integrates the three major modes of text, vision, and voice, and has native perception, reasoning, and execution capabilities. The model supports tool invocation, GUI operation and autonomous planning of complex tasks, and is comparable in evaluations such as audio understanding and image reasoning. Gemini 3 Pro with Claude Opus 4.6 . The model was tested anonymously under the code name “Healer Alpha” and topped the OpenRouter call list. It has now become Xiaomi’s core AI infrastructure for the Agent era.

Main features of Xiaomi MiMo-V2-Omni

  • Full modal perception : The model integrates the three major modalities of text, vision, and audio to achieve image understanding, video analysis, 10+ hours of audio processing, and cross-modal joint reasoning.
  • Agent execution capability : Natively supports tool invocation, GUI operation and autonomous task planning, and can formulate strategies, make real-time corrections and deliver complete results end-to-end.
  • Complex scene applications : Covers real digital environment interaction tasks such as web browsing, code engineering, and front-end development.

Technical principles of Xiaomi MiMo-V2-Omni

  • Unified full-modal architecture : Build a base model that integrates text, vision, and speech from the bottom, and achieve native multi-modal representation through a unified encoder and fusion layer without post-modal splicing.
  • Perception-Action Deep Binding : Breaking the limitations of the traditional model of “emphasis on understanding and neglecting execution”, end-to-end training integrates perception capabilities with action capabilities such as tool invocation and GUI operation, achieving a leap from understanding to control.
  • Video pre-training and long context : Adopt innovative video pre-training methods to achieve joint understanding of audio and video, support ultra-long context modeling, and provide structural advantages for complex Agent tasks.

Key information and usage requirements of Xiaomi MiMo-V2-Omni

  • publisher :Xiaomi technical team
  • Release time :March 19, 2026
  • Internal beta code : Healer Alpha (once listed on OpenRouter anonymously)
  • Model size : Full-modal fusion architecture (text + visual + audio)
  • context window : Supports long sequence modeling (refer to the Pro version of the same series up to 1M)
  • Benchmark ranking : PinchBench ranks first in average, and OpenRouter calls top the list
  • Access method : Through platform API calls such as OpenRouter, it can be seamlessly connected to mainstream Agent frameworks such as OpenClaw
  • Hardware/Environment : Cloud deployment, no local configuration required; supports multi-modal input (images, videos, audio files or streams)

Core advantages of Xiaomi MiMo-V2-Omni

  • Full-modal native fusion : Build a unified architecture of text, vision, and audio from the bottom to achieve true cross-modal understanding and joint reasoning, not simple splicing.
  • perception-action integration : Breaking the limitation of “emphasis on understanding and neglecting execution”, natively internalizes tool calling, GUI operation and other capabilities, forming the compound advantage of “more accurate perception and more effective action”.
  • Very long context support : Supports millions of context windows and has structural advantages when processing long videos, long audios and complex multi-round Agent tasks.
  • Real scenario verification : After anonymous internal testing with Healer Alpha, the number of calls reached the top of OpenRouter and ranked first in PinchBench. It has been double tested by the market and benchmark.
  • Ecological seamless access : Can be quickly integrated OpenClawand other mainstream Agent frameworks, significantly lowering the threshold for full-mode Agent implementation.

How to use Xiaomi MiMo-V2-Omni

Developers can visit https://platform.xiaomimimo.com to register and obtain an API key, and call the interface according to pricing (input $0.4/million tokens, output $2/million tokens).

Comparison of similar competing products of Xiaomi MiMo-V2-Omni

Evaluation dimensionsMiMo-V2-OmniGemini 3 ProClaude Opus 4.6
MMAU-Pro (audio understanding)69.467.0
MMMU-Pro (image understanding)76.881.073.9
Video-MME (Video Understanding)85.388.4
CharXiv RQ (Chart Understanding)80.181.477.4
FutureOmni (future prediction)66.762.960.3
MM-BrowserComp (web browsing)52.037.259.3
OmniGAIA (Multimodal Awareness)49.862.559.7
Claw Eval (complex interaction)54.851.966.3
PinchBench (Agent Comprehensive)85.675.086.3

Application scenarios of Xiaomi MiMo-V2-Omni

  • Multimodal content understanding : The model supports 10+ hours of video analysis, complex chart analysis and cross-modal information association reasoning, achieving joint in-depth understanding of audio and video.
  • Agent task execution : The model can independently complete tasks such as web browsing, code engineering, and front-end development, and can generate web pages with exquisite designs and complete functions from zero samples.
  • GUI automation : Directly control the graphical interface, supporting strategic planning, real-time correction and independent invocation of the tool chain in multiple rounds of dialogue.
  • Enterprise-level long document processing : The model relies on 256K context windows to complete long document analysis, report generation and automated office process decision support. ©