SoulX-LiveAct - Soul App's open-source real-time digital human generation framework
SoulX-LiveAct is an open-source real-time digital human generation framework from the Soul App AI team, solving the stability challenge of streaming AR diffusion models. Core innovations include: Neighbor Forcing technology aligns diffusion steps between adjacent frames to ensure image consistency; and the ConvKV Memory mechanism achieves constant GPU memory usage, supporting generation for hours or even unlimited durations.
- Home page•* AI tools *•*AI projects and frameworks *•*SoulX-LiveAct – Soul App’s open source real-time digital human generation framework
SoulX-LiveAct is a real-time open source software developed by the Soul App AI team. Digital human generation framework, solving the stability problem of AR diffusion model flow generation. Core innovations include Neighbor Forcing technology that aligns the diffusion steps of adjacent frames to ensure consistent images; the ConvKV Memory mechanism achieves constant video memory usage and supports hour-level or even unlimited duration generation. Only dual cards H100/H200 can achieve 20 FPS real-time inference, with a delay of only 0.94 seconds. SoulX-LiveAct is suitable for live broadcast, virtual customer service, podcasting and other scenarios, marking a new stage of open source digital human technology that can be implemented in production environments.
Main functions of SoulX-LiveAct
- Real-time portrait animation generation : Generate high-fidelity digital human videos in real time based on audio and text conditions, achieving precise lip synchronization, natural facial expressions, and coordinated body movements.
- Hourly/unlimited video : Breaking through the duration limit of the traditional model through the constant memory mechanism, it can stably generate hour-level or even infinite-duration continuous video streams.
- Controllable editing of emotions and actions : Supports flexible control of head posture, gesture movements and facial expressions through text commands, such as heart-shaped gestures, face covering, laughing, etc., while maintaining consistent identity and accurate lip synchronization.
- Low latency streaming inference : Only dual cards H100/H200 can achieve 20 FPS real-time output, with an end-to-end delay of only 0.94 seconds, meeting the needs of real-time interaction scenarios such as live broadcasts and virtual customer service.
Technical principles of SoulX-LiveAct
- Neighbor Forcing : The traditional AR diffusion model uses different diffusion steps in adjacent frames, resulting in inconsistent distribution and picture jitter. Neighbor Forcing forces adjacent frames to be generated under the same diffusion step, and uses the latent variable of the previous frame as the conditional input of the current frame, so that the generation process is in a consistent noise space, eliminating stride alignment problems, and achieving stable temporal coherence.
- ConvKV Memory : The video memory bottleneck of long video generation comes from the linear growth of the KV cache with the number of frames. ConvKV Memory adopts the “short-term accuracy + long-term compression” strategy: retaining the high-precision KV cache of recent frames to ensure coherence, compressing historical frames into fixed-length memory through 1D convolution (compression ratio 5:1), and resetting the RoPE position encoding.
- End-to-end performance optimization : The system uses adaptive FP8 precision to reduce the amount of calculations, combines sequence parallelism to make full use of multi-card computing power, and reduces memory access overhead through operator fusion. The three-pronged approach achieves 20 FPS real-time inference, requiring only 27.2 TFLOPs per frame, reducing computing costs by 30%-45% compared with similar methods.
Key information and usage requirements for SoulX-LiveAct
- Project positioning : Soul App AI Lab’s open source real-time interactive digital human generation framework solves the stability and duration limitation issues of AR diffusion models in streaming generation, and supports hour-level or even unlimited duration video synthesis.
- Core Breakthrough – Neighbor Forcing : Align adjacent frames with the same diffusion step to eliminate picture jitter caused by inconsistent distribution.
- Core breakthrough – ConvKV Memory : Constant video memory usage, breaking through the duration bottleneck.
- Core Breakthrough – Real-time Performance : 20 FPS streaming inference, latency 0.94 seconds.
- Measured Performance – Resolution :512×512 or 720×416.
- Tested performance – frame rate :20 FPS.
- Tested Performance – Latency :0.94 seconds.
- Measured Performance – Computational Cost :27.2 TFLOPs/frame.
- Recommended configuration – GPU : 2× NVIDIA H100 or H200.
- Recommended Configuration – Environment : Python 3.10, CUDA support.
- Recommended configuration – key dependencies : SageAttention (FP8 attention), vLLM (FP8 GEMM), LightVAE.
- Consumer Graphics Card – Applicable Models : RTX 4090/5090 single card.
Core Advantages of SoulX-LiveAct
- Neighbor Forcing Technology : Align adjacent frames through the same diffusion step to eliminate picture jitter caused by inconsistent distribution in the traditional AR diffusion model and ensure a stable and coherent generation process.
- ConvKV Memory mechanism : Adopt the “short-term precision + long-term compression” strategy to compress the historical KV cache to a fixed length to achieve constant video memory usage, break through the duration bottleneck, and support hour-level or even unlimited duration generation.
- Real-time streaming inference : The model only requires dual cards H100/H200 to achieve 20 FPS real-time output, with an end-to-end delay of only 0.94 seconds, meeting the needs of real-time interactive scenarios such as live broadcasts.
- low computational cost : Only 27.2 TFLOPs are required per frame, which reduces computing costs by 30%-45% compared to similar methods, taking into account both high quality and high efficiency.
- long term consistency : In hour-level videos, the character’s identity is kept stable, key details are not lost, lip movements are accurately synchronized, and problems such as identity drift and the flickering of accessories are avoided.
How to use SoulX-LiveAct
- Environmental preparation : Use conda to create a Python 3.10 environment named liveact and activate it.
- Install basic dependencies : Install the dependencies in requirements.txt through pip, and install the sox audio processing tool through conda.
- Install SageAttention : Clone the SageAttention repository and switch to v2.2.0 version, run setup.py installation to enable FP8 attention acceleration.
- Install the QKV operator fusion version (optional) : Clone the SageAttentionFusion repository for installation to further improve operator fusion performance.
- Install vLLM : Install vLLM 0.11.0 version through pip, which provides FP8 GEMM matrix operation support.
- Install LightVAE : Clone the LightX2V repository and run setup_vae.py to install the video codec component.
- Download model weights : Download the SoulX-LiveAct model file from Hugging Face or ModelScope to the local directory.
- Download audio encoder : Get the chinese-wav2vec2-base audio feature extraction model.
- Dual-card H100/H200 real-time inference : Set environment variables and run torchrun to start dual-card distributed inference, specify the model path, audio encoder path, input JSON file, and enable 20 FPS streaming audio generation.
- Support reasoning for action/expression editing : Using 512×512 resolution and 24 FPS frame rate, load the example_edit.json file containing editing instructions to achieve controllable expression generation.
- RTX 4090/5090 consumer graphics card running : Enable FP8 KV cache, video memory block offload and T5 text encoder CPU offload in single card mode to reduce video memory usage on consumer-grade graphics cards.
- Prepare to enter data : Edit the JSON configuration file to specify the reference image path, driver audio path, emotional action text prompts and other generation parameters.
- Start live streaming generation : After executing the inference command, the system outputs a digital human video stream with lip synchronization and facial expression movement coordination in real time based on the audio input.
SoulX-LiveAct project address
- Project official website :https://soul-ailab.github.io/soulx-liveact/
- GitHub repository :https://github.com/Soul-AILab/SoulX-LiveAct
- HuggingFace model library :https://huggingface.co/Soul-AILab/LiveAct
- arXiv technical papers :https://arxiv.org/pdf/2603.11746
Comparison of similar competing products of SoulX-LiveAct
| Contrast Dimensions | InfiniteTalk | Live-Avatar | OmniAvatar | SoulX-LiveAct |
|---|---|---|---|---|
| Reasoning efficiency | ||||
| Throughput | 25 FPS | 20 FPS | – | 20 FPS |
| delay | 3.20 seconds | 2.89 seconds | – | 0.94 seconds |
| Number of GPUs | 8 | 5 | – | 2 |
| TFLOPs per frame | 50.2 | 39.1 | – | 27.2 |
| long-term generation ability | ||||
| Video memory usage | linear growth | linear growth | linear growth | constant |
| maximum duration | Limited by video memory | Limited by video memory | Limited by video memory | unlimited |
| identity consistency | late drift | drift gradually | Severe drift | Stable maintenance |
| lip sync | late mismatch | progressive mismatch | Serious mismatch | Continuously accurate |
| Accessories/texture consistency | Flicker and appear | details lost | serious loss | Sustained stability |
Application scenarios of SoulX-LiveAct
- Live broadcast scene : The model can generate digital human anchors in real time, supports 7×24 hours of uninterrupted live broadcast, accurately synchronizes mouth shape and voice, and has natural and rich expressions. It is suitable for e-commerce delivery, entertainment live broadcast, knowledge sharing and other scenarios.
- Virtual customer service : The model can provide all-weather online services, the digital human image is stable and consistent, supports long-term dialogue and interaction, reduces corporate labor costs, and improves service experience.
- Podcast/conversation show : Used for the production of two-person dialogues and interview programs. Natural facial expressions and body language are generated in real time. Guest images are controllable and editable, and high-quality content is quickly produced.
- FaceTime/Video Call : Can be used in B-side scenarios such as virtual social networking, online education, and remote meetings, with latency as low as 0.94 seconds and smooth and natural interaction. ©