Nemotron 3 Super - NVIDIA's open-source large model for agent inference
Nemotron 3 Super is an open-source AI model from NVIDIA with 120 billion parameters. It employs a Mamba-MoE hybrid architecture and is optimized for agent applications. The model supports ultra-long contexts with up to 1 million tokens, offering a 3x speedup inference and a 5x increase in throughput. It demonstrates excellent success rate on the OpenClaw task, with performance approaching that of Claude Opus 4.6. NVIDIA has also open-sourced over 10 trillion tokens of training data, a complete methodology, and 15 reinforcement learning environments, making it an ideal choice for enterprise-grade multi-agent systems. Nemotron 3...
Nemotron 3 Super is a 120 billion parameter open source AI model launched by NVIDIA. It adopts Mamba-MoE hybrid architecture and is specially optimized for intelligent agent applications. The model supports an ultra-long context of 1 million tokens, increasing the inference speed by 3 times and the throughput by 5 times. in OpenClawThe task success rate is excellent and the performance is close to Claude Opus 4.6. NVIDIA has also open sourced more than 10 trillion tokens of training data, complete methodology and 15 reinforcement learning environments, making it an ideal choice for enterprise-level multi-agent systems.
Key features of Nemotron 3 Super
- Very long contextual memory : Supports 1 million token context windows, allowing agents to maintain complete workflow status in complex multi-step tasks and prevent target deviation.
- Agent task execution : Reaching 85.6% task success rate in agent benchmark tests such as OpenClaw, and its performance is close to top closed-source models such as Claude Opus 4.6.
- Speed up reasoning : Realize native speculation decoding through multi-Token prediction technology, increasing reasoning speed by 3 times to meet real-time interaction needs.
- High throughput service : The model’s throughput is increased by 5 times compared to the previous generation model, supports large-scale concurrent agent deployment, and reduces multi-agent application costs.
- High-precision tool calling : It can reliably navigate operations in a huge function library and prevent execution errors in high-risk critical environments such as network security.
- Code agent development : The model can load the entire code base into the context at once to achieve end-to-end code generation, vulnerability repair and automated debugging.
- financial analysis processing : Thousands of pages of reports can be loaded directly into memory, eliminating the trouble of repeatedly re-reasoning during lengthy conversations and greatly improving work efficiency.
Technical principles of Nemotron 3 Super
- Mamba-MoE hybrid architecture: The model adopts an 88-layer network structure, with the Mamba-2 layer and the Transformer attention layer periodically alternately arranged. The Mamba-2 layer provides sequence modeling efficiency with linear time complexity, and a small number of Transformer layers serve as global anchor points responsible for cross-location long-distance information routing and high-precision reasoning, significantly improving reasoning throughput while maintaining strong modeling capabilities.
- LatentMoE implicit hybrid expert architecture : NVIDIA’s first new MoE design projects tokens from hidden dimensions to smaller potential dimensions before routing and expert calculations. Routing and expert calculations are performed in this compressed space, which directly reduces the parameter loading and communication volume several times. The saved resources are used to increase the total number of experts and the number of activated experts, achieving the effect of “activating 4 experts at the cost of 1 expert” and improving the model accuracy at almost the same inference cost.
- Multi-Token prediction acceleration : The model simultaneously predicts multiple future tokens at each position, which not only forces the model to learn multi-step causality and long-term text structure to improve quality, but more importantly, it realizes native speculative decoding - the auxiliary prediction head is used as a built-in draft model to quickly generate candidate sequences, and the main model completes the verification in one forward propagation, which greatly reduces the generation delay and has minimal additional overhead.
- NVFP4 low-precision pre-training : The whole process is pre-trained with NVFP4 precision on the Blackwell platform, and the 4-bit floating point format greatly reduces video memory requirements. Under the premise of zero accuracy loss, the inference speed is 4 times faster than the Hopper architecture FP8, proving the feasibility and efficiency of large-scale low-precision training.
Project address of Nemotron 3 Super
- Project official website :https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/
- HuggingFace model library :https://huggingface.co/collections/nvidia/nvidia-nemotron-v3
- technical paper : https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf
Application scenarios of Nemotron 3 Super
- Intelligent platform core engine : As the “strongest open source model” of agent platforms such as OpenClaw, it drives multi-agent collaboration to complete complex long-term tasks and solves the two major bottlenecks of context explosion and thought tax.
- Enterprise level software development : Empowering software development agents from CodeRabbit, Factory, Greptile and other companies to achieve code base-level end-to-end generation, debugging and vulnerability repair, with a SWE-Bench test of 60.47%.
- In-depth research and analysis : Drives the NVIDIA AI-Q research agent, winning the DeepResearch Bench ranking, and supports multi-step reasoning and information integration across massive documents.
- Network security operation and maintenance : In high-risk environments such as autonomous security orchestration, high-precision tools can be used to reliably navigate a huge function library to prevent critical execution errors.
- Financial analysis : Load thousands of pages of financial reports into the memory at one time and conduct in-depth analysis directly without repeated re-reasoning, which greatly improves investment research efficiency. ©