Covo-Audio - Tencent's open-source end-to-end speech model

Covo-Audio is Tencent’s open source 7 billion parameter end-to-end speech model that can directly process continuous audio input and generate audio output. The core innovations of the model include hierarchical three-modal voice-text interleaved architecture, intelligence and speaker decoupling technology, and native full-duplex interaction capabilities. The model is based on Qwen2.5-7BBuilt with Whisper, it achieves SOTA performance in tasks such as spoken dialogue, speech understanding, and audio understanding. As a voice AI with a unified architecture, the model avoids the delay and error accumulation of traditional cascade systems. GPT-4oA powerful open source alternative to voice capabilities.

Covo-Audio’s main features

spoken dialogue : Supports natural multi-round dialogue interaction of end-to-end voice input and voice output.
speech understanding : The model deeply integrates acoustic features and semantic content to achieve comprehensive analysis of high-fidelity speech signals.
audio understanding : The model supports expansion to non-speech scenes, and has comprehensive perception capabilities for generalized audio such as environmental sounds and music.
full duplex interaction : Natively supports low-latency real-time two-way voice communication, allowing natural interruptions and instant responses.

Key information and usage requirements for Covo-Audio

Developer ：Tencent
Model size : 7 billion parameters (7B)
Architecture type : End-to-end unified audio language model
Open source version :Covo-Audio-Chat
base model : Qwen2.5-7B (LLM backbone) + Whisper (audio encoder)
Model format :Safetensensors, BF16 accuracy
Paper :arXiv:2602.09823
Open source agreement : Dedicated License (need to check the warehouse)
Applicable scenarios : Research and experimental purposes
Python version : ≥ 3.11 (recommended)
Depends on installation :Pass requirements.txt One click installation
core dependencies ：Transformers, BigVGAN, huggingface-hub
Hardware resources : A GPU that supports BF16 inference is required (sufficient video memory is recommended), and can be deployed locally or in the cloud for inference.

Covo-Audio’s core advantages

End-to-end unified architecture : The model breaks the traditional ASR→LLM→TTS cascade mode, realizes direct mapping from audio to audio, eliminates error accumulation and significantly reduces inference delay.
Three-modal deep fusion : Establish effective alignment of high-fidelity prosody and robust semantics through hierarchical interleaving of continuous acoustic features, discrete speech tokens, and natural language text.
Decoupling intelligence and timbre : The model uses multi-speaker training to separate dialogue intelligence and speaker characteristics, supporting flexible migration and personalized customization of high-quality speech.
Native full-duplex capability : The model uses low-latency streaming processing to achieve real-time two-way interaction, supports natural interruptions and instant responses, and is close to the human conversation experience.
Open source ecological value : The model balances performance and cost with a parameter scale of 7 billion. The openness of the complete technology stack lowers the application threshold and provides an independent and controllable base solution for Chinese voice AI.

How to use Covo-Audio

Environmental preparation : Create a Python 3.11 environment and install dependencies, execute conda create -n covoaudio python=3.11 and conda activate covoaudio, and complete dependency installation through pip install -r requirements.txt.
Get code : Clone the official GitHub repository locally, run git clone https://github.com/Tencent/Covo-Audio.git and enter the project directory cd Covo-Audio.
Download model : Install HuggingFace tool and download pre-trained weights, execute pip install huggingface-hub and hf download tencent/Covo-Audio-Chat –local-dir ./covoaudio, the model will be automatically overwritten or saved in the specified directory.
Configuration path : If you need to customize the model storage location, modify the model_dir and decode_load_path parameters in example.sh to match the actual path.
Run inference : Execute the one-click inference script bash example.sh, or modify the audio file path in example.py to implement customized input interaction.
Custom use : Replace the input audio path in example.py with your own file to enable end-to-end voice dialogue interaction with the model.

Covo-Audio project address

GitHub repository ：https://github.com/Tencent/Covo-Audio
HuggingFace model library ：https://huggingface.co/tencent/Covo-Audio-Chat
arXiv technical papers :https://arxiv.org/pdf/2602.09823

Comparison of similar competing products of Covo-Audio

Dimensions	Covo-Audio	GPT-4o (Voice)	Mini-Omni
Developer	Tencent	OpenAI	Open source community
Model size	7B parameters	Undisclosed (estimated to be hundreds of B)	2B parameters
Architecture	End-to-end unified	End-to-end native	End-to-end unified
Open source status	Completely open source	Closed source API	Open source
Full duplex support	Native low latency	Native support	Limited support
Chinese optimization	Deep optimization	Universal multilingual basic support
Deployment cost	Medium (feasible with a single card)	High (API calls)	Low (lightweight)

Covo-Audio application scenarios

Intelligent customer service : The model supports end-to-end low-latency interaction and full-duplex interruption capabilities, enabling natural and smooth real-time voice Q&A and multi-tone personalized services.
Smart hardware : The model can provide offline or device-cloud integrated voice assistant capabilities for smart speakers, car systems, and home central controls.
content creation : Supports efficient generation of multi-character dialogue dubbing, podcast content and real-time voice translation services.
Education and training : In-depth understanding of speech emotion and rhythmic details, and the construction of immersive personalized teaching interactive systems such as oral practice and virtual lecturers.
Accessibility services : Use natural voice interaction to replace the visual interface, providing visually impaired groups and the elderly with a convenient way to obtain information and control devices without typing or touching the screen. ©

← Previous Meoo - Alibaba's cloud-based AI development tool Next → SkyClaw - Skywork's cloud-based AI native assistant

The Gemini 3.1 Flash-Lite is Google's lightweight flagship model, emphasizing extreme cost-effectiveness. With an output speed of 363 tokens per second and an input price of $0.25 per million tokens, it outperforms the GPT-5 mini by 5 times in speed, and costs a quarter of the price of the Claude 4.5 Haiku.

CLI-Anything - A native tool for converting HKU open-source code into AI agents

CLI-Anything is an open-source tool from the Data Science Lab at the University of Hong Kong (HKUDS) that can convert the codebase of any open-source software into a command-line interface (CLI) usable by AI Agents with a single click. Through a 7-stage automated process (analysis, design, implementation, testing, etc.), the tool transforms professional software such as GIMP, Blender, and LibreOffice from fragile GUI automation into stable, structured, and programmable native Agent tools, realizing the vision of "Today's software is for people, tomorrow's users are..."

Xiaomi MiMo-V2-Pro - Xiaomi's flagship Agent model

Xiaomi MiMo-V2-Pro is Xiaomi's flagship large-scale model for the Agent era, boasting over 1 trillion parameters (42B activation parameters) and supporting ultra-long contexts with 1 million tokens. The model employs an innovative hybrid attention architecture, deeply optimized for complex Agent tasks, and performs top-tier in intelligent agent frameworks such as OpenClaw and Claude Code, with performance approaching that of Claude Opus 4.6. It ranks eighth globally and second in China on authoritative large-scale model comprehensive intelligence rankings, signifying Xiaomi's leading position in AI...

Gemini Embedding 2 - Google's first native multimodal embedding model

Gemini Embedding 2 is Google's first native multimodal embedding model, built on the Gemini architecture. The model maps text, images, videos, audio, and documents to a unified vector space, supporting semantic understanding across more than 100 languages. It can handle interleaved multimodal inputs (such as text-image combinations), embedding directly without audio transcription, and employs nested representation learning techniques for flexible dimensionality reduction. Gemini Embedding 2 boasts leading performance in tasks such as RAG and semantic search, and is now available through the Gemini API and Vertex...