SoulX-FlashTalk - Soul App's open-source real-time digital human generation model
SoulX-FlashTalk is the first 14-parameter real-time digital human generation model open-sourced by Soul App's AI team, achieving sub-second latency of 0.87 seconds and a high frame rate of 32fps. The model employs bidirectional streaming distillation and a multi-step self-correction mechanism to achieve stable generation for unlimited duration, full-body motion interaction, and multi-language support. It is suitable for 24/7 live streaming, virtual customer service, game NPCs, and other scenarios. The model has already entered the HuggingFace I2V trending list...
SoulX-FlashTalk is the first 14B parameter real-time open sourced by Soul App’s AI team digital human generationmodel, achieving sub-second latency of 0.87 seconds and a high frame rate of 32fps. The model uses bidirectional flow distillation and a multi-step self-correction mechanism to achieve infinite-duration stable generation, full-body action interaction, and multi-language drive. It is suitable for 7×24-hour live broadcasts, virtual customer service, game NPC and other scenarios. The model has now ranked among the TOP5 of the HuggingFace I2V trend list, providing an open source solution for commercial-level real-time digital human applications.
Main functions of SoulX-FlashTalk
- Real-time audio and video generation : Based on the 14B large model, it achieves 0.87 seconds sub-second delay and 32fps high frame rate output, meeting the needs of live broadcast-level real-time interaction.
- Audio drives digital people : Supports receiving voice or audio input, and accurately drives the virtual image’s mouth shape, facial expressions and body movements to change synchronously.
- whole body action synthesis : Supports dynamic generation of whole body limbs and high-precision hand movement performance.
- Super long and stable generation : The self-correction mechanism ensures consistent identity, stable images, and loss-free image quality during the long-term generation process.
- Multi-language support : The model uses a Chinese-optimized speech encoder and a Chinese-English bilingual subtitle encoder to support cross-language digital human driving.
- Unlimited streaming generation : Supports 7×24 hours of continuous uninterrupted live broadcast, and the system runs stably without crashing or lagging.
- multi-style image : Compatible with various visual styles such as cartoons and real people, meeting the image customization needs of different application scenarios.
Technical principles of SoulX-FlashTalk
- bidirectional flow distillation : By retaining the intra-block bidirectional attention mechanism during the streaming generation process, it effectively maintains spatio-temporal correlation and significantly simplifies the training process. The model only requires 1000 steps of supervised fine-tuning and 200 steps of distillation to converge. Compared with traditional methods, it achieves a 23-fold increase in training efficiency, laying the foundation for real-time deployment of large models.
- Delay-aware spatiotemporal adaptation : As the first-stage training strategy, it is specially optimized for low-resolution input, short frame sequences, and dynamic aspect ratio bucketing, so that the large model with 14B parameters can first adapt to the needs of fast inference, reduce the computational burden while maintaining the generation quality, and solve the contradiction between the amount of parameters of the large model and the inference speed.
- Multi-step review self-correction mechanism : Used to ensure the stability of unlimited duration generation. It can detect and correct accumulated errors in real time during the generation process to prevent errors from snowballing over time. It ensures consistent identity features, stable and smooth images, and lossless visual quality when generating long videos, achieving truly “infinite streaming” output.
- 3D VAE latent space compression : Based on the WAN2.1 architecture, it performs efficient latent space encoding and decoding of high-resolution videos, greatly reducing the computational burden of real-time generation; with the full 3D attention and multi-modal cross-attention mechanism of the 14B DiT generator, and the multi-dimensional encoding of speech, images, and text by the conditional encoder layer, a complete end-to-end real-time digital human generation system is constructed.
SoulX-FlashTalk project address
- Project official website :https://soul-ailab.github.io/soulx-flashtalk/
- GitHub repository :https://github.com/Soul-AILab/SoulX-FlashTalk
- HuggingFace model library :https://huggingface.co/Soul-AILab/SoulX-FlashTalk-14B
Application scenarios of SoulX-FlashTalk
- 7×24 hours AI live broadcast room : E-commerce digital human anchors can achieve uninterrupted live broadcasts around the clock, read and reply to barrage interactions in real time, significantly reducing labor costs while maintaining a natural and smooth live broadcast experience.
- AI virtual tutor and smart customer service : Used in scenarios such as bank tellers and online education, it provides a face-to-face interactive experience similar to video calls, and supports real-time voice Q&A and emotional feedback.
- Mass production of high-quality short videos and short plays : Only a piece of audio can be used to directly generate a complete digital human video, without the need for motion capture equipment and post-production. The long video output quality is stable and consistent, greatly improving content production efficiency.
- Real-time NPCs in the game : The model supports voice-driven non-scripted dialogue, enabling real-time linkage of emotions and actions, providing players with a more immersive and dynamic interactive experience. ©