MOVA - An end-to-end audio and video model | AI toolset open-sourced by Innovation Academy and Mosin Intelligence.

MOVA (MOSS Video and Audio) is jointly launched by the OpenMOSS team of Shanghai Chuangzhi University and MOSI. It is China’s first high-performance open source audio and video end-to-end generation model. The model breaks through the limitations of traditional video “mute” and adopts a heterogeneous twin-tower architecture and a two-way bridge module to achieve native cross-modal interaction. The model has 32 billion parameters (MoE architecture, 18 billion inference activations), can simultaneously generate up to 8 seconds of 720p resolution video and supporting audio, and performs excellently in movie-level lip synchronization and environmental sound effect compatibility.

Main functions of MOVA

End-to-end audio and video generation : The model can simultaneously output video and supporting audio at one time, saying goodbye to “dumb video”.
Dual mode driver generation : Supports image + text or plain text input, and flexibly controls the generated content.
Cinematic lip sync : The model can accurately match the character’s mouth shape and voice when speaking, and supports multi-character dialogue in Chinese and English.
Intelligent ambient sound effects : Automatically synthesize matching background music, action sounds and environmental sounds based on the scene.
Video text rendering : The model can generate clear and readable dynamic text content at designated positions on the screen.
High resolution output : The model supports the generation of audio-visual clips with a maximum resolution of 720p and a duration of 8 seconds.

Technical principles of MOVA

Heterogeneous twin-tower architecture : The model uses the 14B video diffusion model and the 1.3B audio diffusion model to process visual and auditory information respectively, and realizes the deep cross-attention fusion of two layers of hidden states through the two-way bridging module, allowing the whole process of picture generation to perceive the sound rhythm.
Cross-modal time alignment : There is a huge difference in sampling density between video and audio. The Aligned ROPE mechanism unifies the tokens of the two modes into the same physical time coordinate system through precise scaling mapping, fundamentally eliminating the problem of audio and video desynchronization.
Progressive training strategy : The model is trained in three stages from coarse to fine. It first uses 360p low resolution to allow the randomly initialized bridge module to quickly learn audio and video alignment, gradually improves the alignment stability, and finally expands to 720p high resolution for image quality refinement.
Dual CFG reasoning : In view of the characteristics of two control sources for joint audio and video generation, text instructions and modal bridging, it supports independent adjustment of the guidance weights of the two, maintaining picture quality in general scenarios and enhancing mouth shape accuracy in dialogue scenarios.

MOVA’s project address

Project official website ：https://mosi.cn/models/mova
GitHub repository ：https://github.com/OpenMOSS/MOVA
HuggingFace model library ：https://huggingface.co/collections/OpenMOSS-Team/mova

MOVA application scenarios

Film and television production : Quickly generate storyboard previews and dubbing samples, reducing pre-production costs and accelerating creative verification.
Short video creation : Provide creators with high-quality plot materials with sound effects, improve production efficiency, and enrich content forms.
game development : Automatically generate cutscenes and character dialogues to achieve an immersive experience with synchronized audio and video, shortening the development cycle.
Education and training : Produce multi-language instructional videos with accurate mouth shapes, support global content adaptation, and improve learning effects.
E-commerce marketing : Produce product display videos with explanations and background music to accelerate marketing content iteration and enhance conversion capabilities.

← Previous Moltbook - A social networking platform designed specifically for AI agents Next → MiniMax Music 2.5 - MiniMax's AI music creation model

EvoMap is the world's first experience-based genetic network protocol for AI agents. Through the Genome Evolution Protocol (GEP), it enables AI agent capabilities to be inherited, shared, and evolved across individuals, much like biological genes. Developers can encapsulate effective strategies accumulated by the agent in tasks into...

Fun-AudioGen-VD - A sound design model launched by Ali Tongyi Lab

Fun-AudioGen-VD is an innovative large-scale speech model launched by the speech team of Alibaba Tongyi Labs. Positioned as a professional tool for "sound design and contextualized audio generation," the model supports "FreeStyle" command generation, capable of generating high-quality audio containing specific timbre, emotional expression, and a complete auditory scene in one go based on natural language descriptions, achieving integrated sound creation of "character + scene." Regarding timbre control, Fun-AudioGen-VD...

Claude Sonnet 4.6 - Anthropic's latest generation AI model

Claude Sonnet 4.6 is Anthropic's latest generation AI model, positioned as a balance between high performance and cost-effectiveness. It features comprehensive upgrades in core capabilities such as programming, computer operation, long text reasoning, and agent planning, with performance approaching that of the flagship Opus 4.6, while its API pricing is only one-fifth of it. Sonnet...