Voxtral Transcribe 2 - A series of speech-to-text models launched by Mistral AI
Voxtral Transcribe 2 is a new generation of speech-to-text models launched by Mistral AI, including two versions: Voxtral Mini Transcribe V2 focuses on batch transcription and supports 13 languages, speaker separation, word-level timestamps, and context bias.
Voxtral Transcribe 2 is the next generation series from Mistral AI speech to textThe model includes two versions: Voxtral Mini Transcribe V2 focuses on batch transcription and supports 13 languages, speaker separation, word-level timestamps and context bias functions; Voxtral Realtime is specially designed for real-time scenarios and adopts a streaming architecture. The delay can be configured as low as less than 200 milliseconds, making it suitable for interactive applications such as voice assistants. The two models lead in accuracy in benchmark tests such as FLEURS, and their price/performance ratio is significantly better than GPT-4o mini , Gemini Wait for competitors.
Key features of Voxtral Transcribe 2
- Multilingual transcription : Supports high-precision speech-to-text in 13 languages including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian and Dutch.
- word level timestamp : Generate precise start and end times for each transcribed word, suitable for subtitle generation and content alignment.
- speaker separation : Automatically identify different speakers and mark their speaking periods, clearly distinguishing multi-party conversations.
- context bias : Supports input of up to 100 custom words to improve the accuracy of recognition of proper nouns and industry terms.
- Ultra-low latency real-time transcription : Voxtral Realtime adopts a streaming architecture, and the delay can be configured to less than 200 milliseconds, allowing you to listen and write at the same time.
- Noise robustness : Maintain high transcription accuracy in noisy environments such as factories and call centers.
- Long audio processing : A single request can process audio files of up to 3 hours.
- Multiple format support : Compatible with .mp3, .wav, .m4a, .flac, .ogg formats, single file maximum 1GB.
Technical principles of Voxtral Transcribe 2
- streaming architecture : Voxtral Realtime adopts a native streaming architecture, making its latency configurable as low as less than 200 milliseconds to meet real-time interaction needs.
- Dynamic delay configuration : The Realtime model supports flexible delay settings. When the delay is 2.4 seconds, the accuracy matches the batch model. When the delay is 480 milliseconds, the word error rate is only 1-2% higher than the offline model. Users can balance speed and accuracy according to the scenario.
- Unified multi-language modeling : Both models use a single architecture to process 13 languages, and achieve cross-language transfer through shared representation learning, enabling non-English languages to obtain recognition performance similar to English.
- context bias mechanism : The system supports injecting up to 100 custom words as prior knowledge, improving the recognition probability of specific terms during the decoding process, and optimizing the spelling accuracy of proper nouns and industry terms.
- Edge optimized design : Voxtral Realtime achieves efficient inference at a scale of 4 billion parameters, can run on consumer-grade hardware, takes into account model capabilities and deployment costs, and supports privacy-first localized processing.
Voxtral Transcribe 2 project address
- Project official website :https://mistral.ai/news/voxtral-transcribe-2
- HuggingFace model library : https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602
Application scenarios of Voxtral Transcribe 2
- Conference intelligence : The model can transcribe multi-language conference recordings and clearly mark the identity of the speaker through speaker separation, process large-scale conference content at extremely low unit costs, and achieve efficient conference records and knowledge management.
- Voice Assistants vs. Virtual Assistants : Build conversational AI with ultra-low latency of sub-200 milliseconds, connect large language models and speech synthesis pipelines, and create a voice user interface that responds naturally and interacts smoothly.
- call center automation : Transcribe call content in real time, allowing the AI system to analyze customer emotions, recommend response techniques and automatically fill in CRM fields during the call. Speaker separation ensures clear distinction between agent and customer conversations.
- media and broadcasting : Generate real-time multilingual subtitles with minimal latency, and contextual biasing to accurately handle names and technical terms that common transcription services struggle to recognize. ©