FireRedASR2S - Xiaohongshu's open-source speech recognition model

FireRed ASR2S is a Super... on Xiaohongshu

FireRedASR2S - Xiaohongshu's open-source speech recognition model

FireRedASR2S is Xiaohongshu Super Intelligence-AudioLab open source industrial-grade end-to-end speech recognitionThe model integrates the four SOTA modules of ASR, VAD, language recognition and punctuation prediction. The model supports Chinese Mandarin and 20+ dialects, English, code switching and lyrics recognition. The Chinese Mandarin word error rate is as low as 2.89%, and the average dialect error rate is 11.55%, which is comprehensively ahead of Doubao-ASR, Qwen3-ASRWait for competing products. The system supports one-click local deployment without the need for external APIs, and has been implemented on a large scale in high-frequency scenarios such as Xiaohongshu voice comments and voice searches.

Main features of FireRedASR2S

  • Speech recognition (FireRedASR2) : Supports Chinese Mandarin, 20+ dialects/accents, English, Chinese-English mixed and lyrics recognition, and provides two architecture versions: LLM and AED. The AED version supports word-level timestamps and confidence output.
  • Voice Activity Detection (FireRedVAD) : The model can detect speech/singing/music, supports 100+ languages, provides streaming and non-streaming modes, and has an F1 score of 97.57%.
  • Language identification (FireRedLID) : Supports recognition of 100+ languages and 20+ Chinese dialects, with an accuracy of 97.18%, significantly better than open source solutions such as Whisper.
  • Punctuation prediction (FireRedPunc) : The model automatically adds Chinese and English punctuation, with an average F1 score of 78.90%, greatly improving the readability of the transcribed text.

Technical principles of FireRedASR2S

  • Speech recognition (FireRedASR2) : The model adopts two architectures: Encoder-Adapter-LLM and Attention-based Encoder-Decoder. The LLM version uses large language model capabilities to achieve end-to-end speech understanding. The AED version optimizes computing efficiency on the encoder-decoder framework, integrates speech and text representation through the adapter layer, and supports timestamp and confidence output.
  • Voice Activity Detection (FireRedVAD) : Based on DFSMN deep feed-forward sequence memory network, modeling audio timing characteristics. Determine the start and end points of speech through smooth windows and thresholds, distinguish voice, singing, music and other audio events, and support streaming processing to meet real-time requirements.
  • Language identification (FireRedLID) : Reuse the FireRedASR2 encoder to extract speech representations and train the classifier to predict language labels. Utilize large-scale multilingual data pre-training to establish a cross-language shared representation space to achieve high-precision recognition of 100+ languages ​​and dialects.
  • Punctuation prediction (FireRedPunc) : Based on the BERT architecture, it takes unpunctuated text as input and predicts the punctuation type at each position. Through fine-tuning of Chinese and English multi-domain data, it learns text semantics and syntactic structure, and automatically inserts appropriate punctuation marks.

FireRedASR2S project address

Application scenarios of FireRedASR2S

  • Content community interaction : Support Xiaohongshu voice comments, voice search and other functions, allowing users to participate in community interaction with diverse voices such as dialects and singing, enhancing the liveliness and fun of the platform.
  • Social and Communication : Enable voice private messages, voice New Year greetings and other scenarios to achieve natural and smooth voice input and real-time transcription, lower the communication threshold and improve the efficiency of emotional transmission.
  • Content creation and production : Supports creative tools such as voice posting of notes, live subtitle generation, and automatic video subtitles to help creators efficiently produce multimedia content.
  • Enterprise-level services : Suitable for B-side scenarios such as conference transcription, intelligent customer service, and phone analysis. The privatized deployment capabilities meet the data security compliance requirements of finance, medical and other industries. ©