SoulX-FlashTalk - Soul App's open-source real-time digital human generation model

SoulX-FlashTalk is the first 14B parameter real-time open sourced by Soul App’s AI team digital human generationmodel, achieving sub-second latency of 0.87 seconds and a high frame rate of 32fps. The model uses bidirectional flow distillation and a multi-step self-correction mechanism to achieve infinite-duration stable generation, full-body action interaction, and multi-language drive. It is suitable for 7×24-hour live broadcasts, virtual customer service, game NPC and other scenarios. The model has now ranked among the TOP5 of the HuggingFace I2V trend list, providing an open source solution for commercial-level real-time digital human applications.

Main functions of SoulX-FlashTalk

Real-time audio and video generation : Based on the 14B large model, it achieves 0.87 seconds sub-second delay and 32fps high frame rate output, meeting the needs of live broadcast-level real-time interaction.
Audio drives digital people : Supports receiving voice or audio input, and accurately drives the virtual image’s mouth shape, facial expressions and body movements to change synchronously.
whole body action synthesis : Supports dynamic generation of whole body limbs and high-precision hand movement performance.
Super long and stable generation : The self-correction mechanism ensures consistent identity, stable images, and loss-free image quality during the long-term generation process.
Multi-language support : The model uses a Chinese-optimized speech encoder and a Chinese-English bilingual subtitle encoder to support cross-language digital human driving.
Unlimited streaming generation : Supports 7×24 hours of continuous uninterrupted live broadcast, and the system runs stably without crashing or lagging.
multi-style image : Compatible with various visual styles such as cartoons and real people, meeting the image customization needs of different application scenarios.

Technical principles of SoulX-FlashTalk

bidirectional flow distillation : By retaining the intra-block bidirectional attention mechanism during the streaming generation process, it effectively maintains spatio-temporal correlation and significantly simplifies the training process. The model only requires 1000 steps of supervised fine-tuning and 200 steps of distillation to converge. Compared with traditional methods, it achieves a 23-fold increase in training efficiency, laying the foundation for real-time deployment of large models.
Delay-aware spatiotemporal adaptation : As the first-stage training strategy, it is specially optimized for low-resolution input, short frame sequences, and dynamic aspect ratio bucketing, so that the large model with 14B parameters can first adapt to the needs of fast inference, reduce the computational burden while maintaining the generation quality, and solve the contradiction between the amount of parameters of the large model and the inference speed.
Multi-step review self-correction mechanism : Used to ensure the stability of unlimited duration generation. It can detect and correct accumulated errors in real time during the generation process to prevent errors from snowballing over time. It ensures consistent identity features, stable and smooth images, and lossless visual quality when generating long videos, achieving truly “infinite streaming” output.
3D VAE latent space compression : Based on the WAN2.1 architecture, it performs efficient latent space encoding and decoding of high-resolution videos, greatly reducing the computational burden of real-time generation; with the full 3D attention and multi-modal cross-attention mechanism of the 14B DiT generator, and the multi-dimensional encoding of speech, images, and text by the conditional encoder layer, a complete end-to-end real-time digital human generation system is constructed.

SoulX-FlashTalk project address

Project official website ：https://soul-ailab.github.io/soulx-flashtalk/
GitHub repository ：https://github.com/Soul-AILab/SoulX-FlashTalk
HuggingFace model library ：https://huggingface.co/Soul-AILab/SoulX-FlashTalk-14B

Application scenarios of SoulX-FlashTalk

7×24 hours AI live broadcast room : E-commerce digital human anchors can achieve uninterrupted live broadcasts around the clock, read and reply to barrage interactions in real time, significantly reducing labor costs while maintaining a natural and smooth live broadcast experience.
AI virtual tutor and smart customer service : Used in scenarios such as bank tellers and online education, it provides a face-to-face interactive experience similar to video calls, and supports real-time voice Q&A and emotional feedback.
Mass production of high-quality short videos and short plays : Only a piece of audio can be used to directly generate a complete digital human video, without the need for motion capture equipment and post-production. The long video output quality is stable and consistent, greatly improving content production efficiency.
Real-time NPCs in the game : The model supports voice-driven non-scripted dialogue, enabling real-time linkage of emotions and actions, providing players with a more immersive and dynamic interactive experience. ©

← Previous ACE-Step 1.5 - A music generation model open-sourced by ACE Studio and StepFun Next → MiniCPM-o 4.5 - Wallfacer's open-source full-duplex, full-modal model

OpenMAIC is an open-source multi-agent AI classroom platform developed by a Tsinghua University team. It can transform any topic or document into an immersive interactive course with a single click. The platform supports AI teachers giving voice lectures, AI students raising their hands to discuss, and real-time drawing on the whiteboard. It can generate various teaching scenarios such as slides, quizzes, interactive simulations, and project-based learning.

GLM-5 - Zhipu Open Source's next-generation flagship model

GLM-5 is the next-generation flagship model open-sourced by Zhipu AI. The parameter size has been expanded from 355B in GLM-4.5 to 744B (40B activation), and the pre-training data reaches 28.5T tokens. The model is the mysterious "Pony Alpha" model that topped the OpenRouter popularity chart.

CLI-Anything - A native tool for converting HKU open-source code into AI agents

CLI-Anything is an open-source tool from the Data Science Lab at the University of Hong Kong (HKUDS) that can convert the codebase of any open-source software into a command-line interface (CLI) usable by AI Agents with a single click. Through a 7-stage automated process (analysis, design, implementation, testing, etc.), the tool transforms professional software such as GIMP, Blender, and LibreOffice from fragile GUI automation into stable, structured, and programmable native Agent tools, realizing the vision of "Today's software is for people, tomorrow's users are..."

Elon Musk's 3-hour conversation was full of bombshell revelations! Robots will become a "perpetual money-making machine

On February 6th reported that Elon Musk's latest nearly 3-hour interview was released on YouTube early this morning. He revealed several key figures: SpaceX is preparing for 10,000 to 20,000-30,000 launches per year, and its space computing power will exceed the global total in 5 years; Tesla's AI5 chip will be taped out and mass-produced in the second quarter of next year, with the AI6 chip launching less than a year later; Optimus will have a production capacity of one million units in 3 years and ten million units in 4 years. ...