Covo-Audio - Tencent's open-source end-to-end speech model

Covo-Audio is an open-source, 7 billion-parameter end-to-end speech model from Tencent, capable of directly processing continuous audio input and generating audio output. Its core innovations include a hierarchical trimodal speech-text interleaved architecture, decoupling technology between intelligence and speaker, and native full-duplex interaction capabilities. Built on Qwen2.5-7B and Whisper, the model achieves state-of-the-art (SOTA) performance in tasks such as spoken dialogue, speech understanding, and audio understanding. As a unified architecture for speech AI, the model avoids the latency and error accumulation of traditional cascaded systems, making it a powerful open-source alternative to GPT-4o speech capabilities. ...

Covo-Audio - Tencent's open-source end-to-end speech model

Covo-Audio is Tencent’s open source 7 billion parameter end-to-end speech model that can directly process continuous audio input and generate audio output. The core innovations of the model include hierarchical three-modal voice-text interleaved architecture, intelligence and speaker decoupling technology, and native full-duplex interaction capabilities. The model is based on Qwen2.5-7BBuilt with Whisper, it achieves SOTA performance in tasks such as spoken dialogue, speech understanding, and audio understanding. As a voice AI with a unified architecture, the model avoids the delay and error accumulation of traditional cascade systems. GPT-4oA powerful open source alternative to voice capabilities.

Covo-Audio’s main features

  • spoken dialogue : Supports natural multi-round dialogue interaction of end-to-end voice input and voice output.
  • speech understanding : The model deeply integrates acoustic features and semantic content to achieve comprehensive analysis of high-fidelity speech signals.
  • audio understanding : The model supports expansion to non-speech scenes, and has comprehensive perception capabilities for generalized audio such as environmental sounds and music.
  • full duplex interaction : Natively supports low-latency real-time two-way voice communication, allowing natural interruptions and instant responses.

Key information and usage requirements for Covo-Audio

  • Developer :Tencent
  • Model size : 7 billion parameters (7B)
  • Architecture type : End-to-end unified audio language model
  • Open source version :Covo-Audio-Chat
  • base model : Qwen2.5-7B (LLM backbone) + Whisper (audio encoder)
  • Model format :Safetensensors, BF16 accuracy
  • Paper :arXiv:2602.09823
  • Open source agreement : Dedicated License (need to check the warehouse)
  • Applicable scenarios : Research and experimental purposes
  • Python version : ≥ 3.11 (recommended)
  • Depends on installation :Pass requirements.txt One click installation
  • core dependencies :Transformers, BigVGAN, huggingface-hub
  • Hardware resources : A GPU that supports BF16 inference is required (sufficient video memory is recommended), and can be deployed locally or in the cloud for inference.

Covo-Audio’s core advantages

  • End-to-end unified architecture : The model breaks the traditional ASR→LLM→TTS cascade mode, realizes direct mapping from audio to audio, eliminates error accumulation and significantly reduces inference delay.
  • Three-modal deep fusion : Establish effective alignment of high-fidelity prosody and robust semantics through hierarchical interleaving of continuous acoustic features, discrete speech tokens, and natural language text.
  • Decoupling intelligence and timbre : The model uses multi-speaker training to separate dialogue intelligence and speaker characteristics, supporting flexible migration and personalized customization of high-quality speech.
  • Native full-duplex capability : The model uses low-latency streaming processing to achieve real-time two-way interaction, supports natural interruptions and instant responses, and is close to the human conversation experience.
  • Open source ecological value : The model balances performance and cost with a parameter scale of 7 billion. The openness of the complete technology stack lowers the application threshold and provides an independent and controllable base solution for Chinese voice AI.

How to use Covo-Audio

  • Environmental preparation : Create a Python 3.11 environment and install dependencies, execute conda create -n covoaudio python=3.11 and conda activate covoaudio, and complete dependency installation through pip install -r requirements.txt.
  • Get code : Clone the official GitHub repository locally, run git clone https://github.com/Tencent/Covo-Audio.git and enter the project directory cd Covo-Audio.
  • Download model : Install HuggingFace tool and download pre-trained weights, execute pip install huggingface-hub and hf download tencent/Covo-Audio-Chat –local-dir ./covoaudio, the model will be automatically overwritten or saved in the specified directory.
  • Configuration path : If you need to customize the model storage location, modify the model_dir and decode_load_path parameters in example.sh to match the actual path.
  • Run inference : Execute the one-click inference script bash example.sh, or modify the audio file path in example.py to implement customized input interaction.
  • Custom use : Replace the input audio path in example.py with your own file to enable end-to-end voice dialogue interaction with the model.

Covo-Audio project address

Comparison of similar competing products of Covo-Audio

DimensionsCovo-AudioGPT-4o (Voice)Mini-Omni
DeveloperTencentOpenAIOpen source community
Model size7B parametersUndisclosed (estimated to be hundreds of B)2B parameters
ArchitectureEnd-to-end unifiedEnd-to-end nativeEnd-to-end unified
Open source statusCompletely open sourceClosed source APIOpen source
Full duplex supportNative low latencyNative supportLimited support
Chinese optimizationDeep optimizationUniversal multilingual basic support
Deployment costMedium (feasible with a single card)High (API calls)Low (lightweight)

Covo-Audio application scenarios

  • Intelligent customer service : The model supports end-to-end low-latency interaction and full-duplex interruption capabilities, enabling natural and smooth real-time voice Q&A and multi-tone personalized services.
  • Smart hardware : The model can provide offline or device-cloud integrated voice assistant capabilities for smart speakers, car systems, and home central controls.
  • content creation : Supports efficient generation of multi-character dialogue dubbing, podcast content and real-time voice translation services.
  • Education and training : In-depth understanding of speech emotion and rhythmic details, and the construction of immersive personalized teaching interactive systems such as oral practice and virtual lecturers.
  • Accessibility services : Use natural voice interaction to replace the visual interface, providing visually impaired groups and the elderly with a convenient way to obtain information and control devices without typing or touching the screen. ©