Fun-CineForge - Alibaba Tongyi's open-source film-grade multimodal dubbing model
Fun-CineForge is the first film-grade multimodal dubbing model open-sourced by Tongyi Lab. Built on CosyVoice3, it innovatively introduces "temporal modality" to achieve precise audio-visual synchronization. The model supports monologues, narration, dialogues, and multi-person scenes, solving four major challenges: lip-syncing, emotional expression, consistent timbre, and time alignment. Fun-CineForge comes with an open-source CineDub dataset construction workflow, covering over 350 films and TV series, with a Chinese character error rate as low as 1.49%. It maintains high-quality dubbing even in complex scenes such as facial occlusion and camera transitions. ...
Fun-CineForge is the first large film and television-level multi-modal dubbing model open sourced by Tongyi Lab. It is based on CosyVoice3Construction, innovative introduction of “time mode” to achieve precise audio and video synchronization. The model supports monologue, narration, dialogue and multi-person scenes to solve lip sync, emotional expression, consistent timbre, and time alignment. Fun-CineForge is equipped with the open source CineDub data set construction process, covering 350+ film and television dramas, with a Chinese subtitle error rate as low as 1.49%, and high-quality dubbing effects can be maintained even in complex scenes such as facial occlusion and camera switching.
Main functions of Fun-CineForge
- lip sync : The model supports a high degree of synchronization between the synthesized speech and the lip movements of the characters in the picture, achieving precise sound and picture alignment.
- emotional expression : Based on the character’s facial image and instruction description, the personified presentation and free control of emotional tone are achieved.
- **tone clone : The model can refer to the timbre characteristics of the input audio to synthesize highly similar personalized speech.
- time alignment** : Control the start and end of the voice based on the timestamp, and the voice can be generated at the correct time even if the speaker is blocked.
- Multi-scene adaptation : Supports complex film and television dubbing scenes such as monologue, narration, two-person dialogue and multi-person dialogue.
Technical principles of Fun-CineForge
- Multimodal fusion architecture : The model can process four types of information at the same time. The visual mode learns lip movements and facial expressions. The text mode provides line content and character emotional clues. The audio mode serves as the prediction target. The time mode controls the voice appearance period and indicates the speaker’s identity. The four complement each other to achieve accurate dubbing.
- Temporal modal innovation : For the first time, time information is introduced into the dubbing model as an independent modality. Through strong supervision signals such as starting time, duration, and speaker identity, the model understands “when and who speaks” and can accurately locate the speech period when the face is blocked or the camera is switched.
- Data-driven training : Training based on the automatically constructed CineDub data set, which is extracted from film and television materials through processes such as voice separation, text transcription, and speaker separation. It contains frame-level lip data, millisecond-level timestamps, and emotional annotations, providing multi-modal supervision signals for the model.
Fun-CineForge project address
- Project official website :https://funcineforge.github.io/
- GitHub repository :https://github.com/FunAudioLLM/FunCineForge
- HuggingFace model library :https://huggingface.co/FunAudioLLM/Fun-CineForge
Application scenarios of Fun-CineForge
- Film and television post-production : Perform multi-language dubbing for movies and TV series, accurately match mouth shapes and emotions, and handle complex scenes such as camera switching and facial occlusion.
- Animated game development : Generate synchronized audio and video voices for animated characters, support multi-character timbre differentiation, and reduce the cost of dubbing game plots.
- Content localization : Translate and dub overseas film and television works into other languages, retain the emotional rhythm of the original film, and support the conversion of long clips such as narration and monologues.
- Advertising short video : The model can quickly generate voiceovers for spoken videos, adjust the tone according to the mood of the scene, and clone specific timbres to maintain brand consistency.
- Accessibility : The model can generate synchronous narration for silent videos, assist visually impaired users in understanding the picture, and provide accurate subtitle audio matching. ©