Mistral Small 4 - Mistral AI's open-source multimodal large model

Mistral Small 4 is Mistral AI’s open source multi-modal large model. For the first time, the model unifies reasoning (Magistral), multi-modal (Pixtral) and agent coding (Devstral) capabilities into a single architecture. It supports text and image input and can be accessed through reasoning_effort Parameters can flexibly switch between fast response and deep reasoning modes. The model is optimized for enterprise-level efficiency, with latency reduced by 40% and throughput increased by 3 times. It has been launched on Mistral API, Hugging Face and NVIDIA NIM platforms.

Key features of Mistral Small 4

Unified multi-capability architecture : For the first time, chat instructions (Instruct), deep reasoning (Reasoning) and multimodal understanding (Multimodal) are integrated into a single model, eliminating the need to switch between different models.
Adjustable reasoning strength :Pass reasoning_effort Flexible control of parameters: none: Quick response, suitable for daily conversations.
high: Deep step-by-step reasoning, suitable for complex problems. Native multimodal processing : Supports both text and image input, enabling tasks such as document parsing, visual analysis, and image and text understanding. Agent coding ability : Supports development scenarios such as code generation, code base exploration, and automated programming workflow. Long context handling :Supports 256K contextual windows, long document analysis and long conversations Enterprise-level efficiency : Compared with the previous generation, the latency is reduced by 40%, the throughput is increased by 3 times, and supports efficient deployment.

Key information and usage requirements for Mistral Small 4

Architecture ：Mixture of Experts (MoE)
Number of experts : 128 experts, 4 activated per token
total parameters :119 billion (119B)
Activation parameters : 6 billion/token (including embedding layer 8 billion)
context window :256K tokens
Open source agreement :Apache 2.0
Hardware requirements Minimum configuration : 4× NVIDIA HGX H100 or 2× HGX H200 or 1× DGX B200
Recommended configuration : 4× NVIDIA HGX H100 or 4× HGX H200 or 2× DGX B200

Mistral Small 4 Core Advantages

integrated integration : For the first time, the three major capabilities of reasoning, multi-modality, and agent programming are unified into one model, eliminating the need to switch between multiple models.
flexible reasoning : Freely switch between fast response and deep thinking modes through the reasoning_effort parameter, and allocate computing power on demand.
Extreme efficiency : The output length is significantly shorter under the same performance, directly reducing inference costs and improving user experience.
truly open source : The Apache 2.0 protocol supports commercial use and in-depth customization, and works with NVIDIA NeMo to achieve domain fine-tuning.
Ecological binding : As a founding member of the NVIDIA Nemotron Alliance, get full-stack optimization support from hardware to deployment tools.
enterprise value : Lower token cost and more stable quality make large-scale AI deployment more economically feasible.
technical value : High “performance per token” simplifies model selection and reduces fine-tuning iterations and backup system dependencies.

How to use Mistral Small 4

Via Mistral official platform : Called directly in Mistral API or AI Studio, no need to build your own infrastructure, suitable for quick start and prototype verification.
By Hugging Face : Download model weights from the Hugging Face repository, and use open source frameworks such as Transformers, vLLM, llama.cpp, SGLang, etc. for local deployment and inference.
Via NVIDIA platform : Test model performance for free at build.nvidia.com, or deploy production-grade containerization with NVIDIA NIM to get optimized inference performance out of the box.
Customize with fine-tuning : Use the NVIDIA NeMo framework to perform domain-specific fine-tuning of the model to create a customized version that meets specific business needs.
Configure inference strength : Control the behavior with the reasoning_effort parameter when calling: set to “none” for fast response, set to “high” to activate deep reasoning mode.
Hardware requirements : Local deployment requires at least 4x HGX H100 or 1x DGX B200 level computing power. It is recommended to double the configuration to ensure optimal performance.

Mistral Small 4 project address

Project official website ：https://mistral.ai/news/mistral-small-4
HuggingFace model library ：https://huggingface.co/collections/mistralai/mistral-small-4

Comparison of similar competing products of Mistral Small 4

model Open source agreement Parameter quantity context Core advantages Disadvantages     Mistral Small 4 Apache 2.0 119B/6B activation 256K Three-in-one unified, adjustable reasoning, high efficiency Deployment hardware requirements are high   Llama 3.1/3.2 Partially restricted 8B-405B 128K Mature ecology and strong community support Reasoning and multimodality need separate models   Qwen 2.5 Apache 2.0 0.5B-72B 128K Good Chinese optimization, many size choices Long text is slightly less efficient   DeepSeek-V3 MIT 671B/37B activation 64K Strong mathematical reasoning and low cost Limited multimodal support   Gemma 3 Apache 2.0 1B-27B 128K Google ecosystem, lightweight deployment Overall ability is not as good as Small 4

Application scenarios of Mistral Small 4

Smart programming : The model can automatically generate code, fix bugs and understand the architecture of large code bases, improving development efficiency.
Enterprise customer service : Handle daily consultations and complex complaints through adjustable reasoning mode, reducing manual intervention costs.
Document analysis : The model can parse long documents, contracts and cross-file related information, and supports 256K context depth processing.
visual understanding : Supports the identification of invoices, charts and picture contents, and realizes intelligent information extraction through the combination of pictures and text.
Scientific research assistance : The model can complete mathematical derivation, paper interpretation and experimental design, and provide academic support for step-by-step reasoning. ©

← Previous NemoClaw - NVIDIA's open-source enterprise-grade AI agent framework Next → GPT-5.4 nano - A lightweight, fast AI model from OpenAI

AgentScope Java is an open-source Java framework from Alibaba for developing enterprise-level intelligent agents, enabling Java developers to easily build production-grade AI applications. The framework adopts the leading ReAct paradigm, giving large models autonomous reasoning and planning capabilities, while providing a robust runtime control mechanism to ensure a balance between autonomy and controllability.

SoulX-FlashTalk - Soul App's open-source real-time digital human generation model

SoulX-FlashTalk is the first 14-parameter real-time digital human generation model open-sourced by Soul App's AI team, achieving sub-second latency of 0.87 seconds and a high frame rate of 32fps. The model employs bidirectional streaming distillation and a multi-step self-correction mechanism to achieve stable generation for unlimited duration, full-body motion interaction, and multi-language support. It is suitable for 24/7 live streaming, virtual customer service, game NPCs, and other scenarios. The model has already entered the HuggingFace I2V trending list...

Fun-CineForge - Alibaba Tongyi's open-source film-grade multimodal dubbing model

Fun-CineForge is the first film-grade multimodal dubbing model open-sourced by Tongyi Lab. Built on CosyVoice3, it innovatively introduces "temporal modality" to achieve precise audio-visual synchronization. The model supports monologues, narration, dialogues, and multi-person scenes, solving four major challenges: lip-syncing, emotional expression, consistent timbre, and time alignment. Fun-CineForge comes with an open-source CineDub dataset construction workflow, covering over 350 films and TV series, with a Chinese character error rate as low as 1.49%. It maintains high-quality dubbing even in complex scenes such as facial occlusion and camera transitions. ...

Gemini 3.1 Flash-Lite - Google's Lightweight Flagship Model

The Gemini 3.1 Flash-Lite is Google's lightweight flagship model, emphasizing extreme cost-effectiveness. With an output speed of 363 tokens per second and an input price of $0.25 per million tokens, it outperforms the GPT-5 mini by 5 times in speed, and costs a quarter of the price of the Claude 4.5 Haiku.