PrismAudio - A video-to-audio generation framework launched by Alibaba Tongyi

PrismAudio is a video-to-audio framework launched by Alibaba Tongyi Lab, which can automatically add environmental sound effects to silent videos. The model pioneered the “decomposed thinking chain” technology, which allows the model to first think about sound content, timing, texture, and spatial location, and then generate audio, and introduces four “teachers” (semantics, timing, aesthetics, and space) for multi-dimensional scoring optimization. The model has only 518 million parameters, and it only takes 0.63 seconds to generate 9 seconds of audio. Its performance completely surpasses existing methods, and it has been included in ICLR 2026.

Main functions of PrismAudio

Video to audio : The model can automatically generate environmental sound effects (such as the sound of horse hooves, wind and rain, etc.) that match the picture for silent videos.
semantic alignment : The model can ensure that the generated sound content accurately corresponds to the objects and actions in the video, avoiding inconsistencies between the sound and the video.
Timing synchronization : Supports precise control of the timing of sound and visual events to achieve tight synchronization.
Aesthetic optimization : The model generates natural, layered, and non-electronic high-quality audio to enhance the listening experience.
spatial positioning : Supports stereo output, automatically adjusts the left and right channels according to the position of the sound source in the picture, and achieves sound position identification.
Thought chain reasoning : Adopt a decomposed thinking chain of “think first, then speak” to make the generation process explainable and controllable.

Key information and usage requirements for PrismAudio

Developer ：Alibaba Tongyi Fun Team
technology type : Video to Audio (V2A) framework
core innovation : Decomposed thinking chain + multi-dimensional reinforcement learning
Model size :518 million parameters
Output specifications : 44kHz stereo
Reasoning speed : It only takes 0.63 seconds to generate 9 seconds of audio
Input format : Silent video (supports common video formats)
Content restrictions : Only generates environmental sounds/sound effects, does not support character dubbing
Optional input : Can be used with text description to assist in generation (not required)
Hardware requirements : Supports GPU acceleration and can also be run on CPU

PrismAudio’s core advantages

Four-dimensional collaborative optimization : The first decomposed thinking chain, which independently models and collaboratively optimizes the four dimensions of semantics, timing, aesthetics, and space, avoids the shortcomings of the traditional model of “focusing on one and losing sight of the other”, and achieves a high degree of unity of audio and video.
Think before speaking : Breaking through the end-to-end black box generation model, the model first outputs structured reasoning text (sound content, timing, texture, orientation), and then generates audio. The process is explainable and controllable.
Efficient and lightweight : With only 518 million parameters, it only takes 0.63 seconds to generate 9 seconds of audio, which is nearly twice as fast as similar models and is more suitable for real-time application scenarios.
Robust in complex scenes : The performance far exceeds existing methods on the self-built AudioCanvas complex scene benchmark, and it can still maintain stable output in multi-event and multi-sound source scenarios.

How to use PrismAudio

Online experience (recommended for novices) : Visit Hugging Face to experience the Demo online, upload a silent video, optionally enter a text description to assist in the generation, and AI automatically generates audio files.
local deployment : Download the open source code and model weights from GitHub or Hugging Face, load the pre-trained model after installing the dependent environment, input the video path and call the inference interface to generate audio, and support custom adjustment of thinking chain parameters or reward weights.

PrismAudio project address

Project official website ：https://prismaudio-project.github.io/
GitHub repository ：https://github.com/FunAudioLLM/ThinkSound/tree/prismaudio
HuggingFace model library ：https://huggingface.co/FunAudioLLM/PrismAudio
arXiv technical papers ：https://arxiv.org/pdf/2511.18833
Experience Demo online ：https://huggingface.co/spaces/FunAudioLLM/PrismAudio

Comparison of similar competing products of PrismAudio

Contrast Dimensions	PrismAudio	MMAudio	ThinkSound
Developer	Ali Tongyi Laboratory	Nanyang Technological University, Singapore, etc.	Ali Tongyi Laboratory
core methods	Decomposed thinking chain + multi-dimensional reinforcement learning	MultimodalTransformer	Single thinking chain
Parameter quantity	518 million	about 1 billion	Billions
Reasoning speed	0.63 seconds/9 seconds audio	1.30 seconds/9 seconds audio	1.07 seconds/9 seconds audio
Output sound quality	44kHz stereo	44kHz mono	44kHz stereo
Semantic Consistency (CLAP)	0.47	0.40	0.43
Timing synchronization (DeSync)	0.41	0.46	0.55
spatial accuracy (CRW)	7.72	—	13.47
Aesthetic quality (MOS-Q)	4.21	3.95	4.05

Application scenarios of PrismAudio

Film and television post-production : Automatically generate environmental sound effects for movies, documentaries, and trailers, replacing traditional foley work and reducing post-production costs and time.
Short video creation : Quickly add ambient sound to silent videos such as Vlogs, food, and travel to enhance the immersion and communication effect of ASMR and healing content.
game development : Generate dynamic sound effects for cutscenes and CG promotional videos, and match environmental sounds in real time according to scenes such as forests, cities, and battlefields, reducing duplication of work by sound effects engineers.
advertising marketing : Automatically add operating sound effects to product display videos, support rapid iteration of multiple versions of audio tracks, and improve advertising testing efficiency and creative flexibility.
Education and training : Supplement prompt sounds and background sounds for teaching videos and operation demonstrations, enrich the auditory experience of multimedia courseware, and improve learning concentration and information absorption rate. ©

← Previous New API - Open Source AI Large Model Gateway and Asset Management System Next → TypeNo - A free and open-source AI-powered Chinese voice input method, ready to use out of the box

It's only the beginning of 2026, and Wall Street is already lining up with IPO prospectuses. In 2013, Musk stated that SpaceX would never go public, but recent news indicates he's combining rockets and AI; SpaceX and xAI plan to merge and go public this year. The IPO is expected to reach a valuation of $1.5 trillion. What made Musk suddenly change his mind?

Mistral Small 4 - Mistral AI's open-source multimodal large model

Mistral Small 4 is an open-source multimodal large model from Mistral AI. It is the first model to unify reasoning (Magistral), multimodal (Pixtral), and agent encoding (Devstral) capabilities into a single architecture. It supports text and image input and can flexibly switch between fast response and deep reasoning modes through the reasoning_effort parameter.

MoiAI - AI Desktop Intelligent Agent, Privacy-Focused and Localized Deployment

MoiAI is a next-generation desktop intelligent agent for professionals, featuring a privacy-first architecture. In the enterprise version, all data processing is completed on local servers, achieving zero cloud residency. MoiAI can automate routine office tasks such as email drafting and sending, Word document creation, and PowerPoint presentation generation, and can learn users' unique work styles to provide personalized output.

KiKi - Tencent Cloud's AI Agent Automation Assistant

KiKi is an AI Agent automation assistant launched by Tencent Cloud, originating from the enterprise-grade platform Helix. Users only need to describe their goals in natural language, and KiKi can automatically break down tasks, plan steps, and execute them, covering the entire process of resource query, service purchase, and application deployment, and seamlessly switching between multiple platforms such as the official website, console, and purchase page.