PrismAudio - A video-to-audio generation framework launched by Alibaba Tongyi

PrismAudio, developed by Alibaba's Tongyi Lab, is a video-to-audio framework that automatically adds ambient sound effects to silent videos. The model pioneers a "decompositional thinking chain" technique, allowing it to first consider sound content, timing, texture, and spatial location before generating audio. It incorporates four "teachers" (semantic, temporal, aesthetic, and spatial) for multi-dimensional scoring and optimization. With only 518 million parameters, the model generates 9 seconds of audio in just 0.63 seconds, significantly outperforming existing methods. It has been accepted by ICLR 2026. PrismAudio's main functions...

PrismAudio - A video-to-audio generation framework launched by Alibaba Tongyi

PrismAudio is a video-to-audio framework launched by Alibaba Tongyi Lab, which can automatically add environmental sound effects to silent videos. The model pioneered the “decomposed thinking chain” technology, which allows the model to first think about sound content, timing, texture, and spatial location, and then generate audio, and introduces four “teachers” (semantics, timing, aesthetics, and space) for multi-dimensional scoring optimization. The model has only 518 million parameters, and it only takes 0.63 seconds to generate 9 seconds of audio. Its performance completely surpasses existing methods, and it has been included in ICLR 2026.

Main functions of PrismAudio

  • Video to audio : The model can automatically generate environmental sound effects (such as the sound of horse hooves, wind and rain, etc.) that match the picture for silent videos.
  • semantic alignment : The model can ensure that the generated sound content accurately corresponds to the objects and actions in the video, avoiding inconsistencies between the sound and the video.
  • Timing synchronization : Supports precise control of the timing of sound and visual events to achieve tight synchronization.
  • Aesthetic optimization : The model generates natural, layered, and non-electronic high-quality audio to enhance the listening experience.
  • spatial positioning : Supports stereo output, automatically adjusts the left and right channels according to the position of the sound source in the picture, and achieves sound position identification.
  • Thought chain reasoning : Adopt a decomposed thinking chain of “think first, then speak” to make the generation process explainable and controllable.

Key information and usage requirements for PrismAudio

  • Developer :Alibaba Tongyi Fun Team
  • technology type : Video to Audio (V2A) framework
  • core innovation : Decomposed thinking chain + multi-dimensional reinforcement learning
  • Model size :518 million parameters
  • Output specifications : 44kHz stereo
  • Reasoning speed : It only takes 0.63 seconds to generate 9 seconds of audio
  • Input format : Silent video (supports common video formats)
  • Content restrictions : Only generates environmental sounds/sound effects, does not support character dubbing
  • Optional input : Can be used with text description to assist in generation (not required)
  • Hardware requirements : Supports GPU acceleration and can also be run on CPU

PrismAudio’s core advantages

  • Four-dimensional collaborative optimization : The first decomposed thinking chain, which independently models and collaboratively optimizes the four dimensions of semantics, timing, aesthetics, and space, avoids the shortcomings of the traditional model of “focusing on one and losing sight of the other”, and achieves a high degree of unity of audio and video.
  • Think before speaking : Breaking through the end-to-end black box generation model, the model first outputs structured reasoning text (sound content, timing, texture, orientation), and then generates audio. The process is explainable and controllable.
  • Efficient and lightweight : With only 518 million parameters, it only takes 0.63 seconds to generate 9 seconds of audio, which is nearly twice as fast as similar models and is more suitable for real-time application scenarios.
  • Robust in complex scenes : The performance far exceeds existing methods on the self-built AudioCanvas complex scene benchmark, and it can still maintain stable output in multi-event and multi-sound source scenarios.

How to use PrismAudio

  • Online experience (recommended for novices) : Visit Hugging Face to experience the Demo online, upload a silent video, optionally enter a text description to assist in the generation, and AI automatically generates audio files.
  • local deployment : Download the open source code and model weights from GitHub or Hugging Face, load the pre-trained model after installing the dependent environment, input the video path and call the inference interface to generate audio, and support custom adjustment of thinking chain parameters or reward weights.

PrismAudio project address

Comparison of similar competing products of PrismAudio

Contrast DimensionsPrismAudioMMAudioThinkSound
DeveloperAli Tongyi LaboratoryNanyang Technological University, Singapore, etc.Ali Tongyi Laboratory
core methodsDecomposed thinking chain + multi-dimensional reinforcement learningMultimodalTransformerSingle thinking chain
Parameter quantity518 millionabout 1 billionBillions
Reasoning speed0.63 seconds/9 seconds audio1.30 seconds/9 seconds audio1.07 seconds/9 seconds audio
Output sound quality44kHz stereo44kHz mono44kHz stereo
Semantic Consistency (CLAP)0.470.400.43
Timing synchronization (DeSync)0.410.460.55
spatial accuracy (CRW)7.7213.47
Aesthetic quality (MOS-Q)4.213.954.05

Application scenarios of PrismAudio

  • Film and television post-production : Automatically generate environmental sound effects for movies, documentaries, and trailers, replacing traditional foley work and reducing post-production costs and time.
  • Short video creation : Quickly add ambient sound to silent videos such as Vlogs, food, and travel to enhance the immersion and communication effect of ASMR and healing content.
  • game development : Generate dynamic sound effects for cutscenes and CG promotional videos, and match environmental sounds in real time according to scenes such as forests, cities, and battlefields, reducing duplication of work by sound effects engineers.
  • advertising marketing : Automatically add operating sound effects to product display videos, support rapid iteration of multiple versions of audio tracks, and improve advertising testing efficiency and creative flexibility.
  • Education and training : Supplement prompt sounds and background sounds for teaching videos and operation demonstrations, enrich the auditory experience of multimedia courseware, and improve learning concentration and information absorption rate. ©