Step 3.5 Flash - The latest open-source pedestal model from Step Star

Step 3.5 Flash is StepStar's latest open-source foundation model, specifically designed for Agent scenarios. The model employs a sparse MoE architecture with a total of 196 billion parameters, activating only 11 billion parameters per token, balancing performance and efficiency. Step 3.5 Flash boasts an inference speed of up to 350 TPS, supports 256K long contexts, and rivals top-tier closed-source models in mathematical inference, code generation (SWE-bench 74.4%), and Agent tasks. Step 3.5 Flash is open-source and supports...

Step 3.5 Flash - The latest open-source pedestal model from Step Star

Step 3.5 Flash is the latest open source base model of Step Star, specially launched for Agent scenarios. The model adopts a sparse MoE architecture with a total of 196 billion parameters, and only 11 billion parameters are activated per token, taking into account both performance and efficiency. Step 3.5 Flash inference speed is up to 350 TPS, supports 256K long context, and is comparable to top closed-source models in mathematical reasoning, code generation (SWE-bench 74.4%) and Agent tasks. Step 3.5 Flash is open source and supports vLLM, SGLang, llama.cpp and other frameworks. It can be deployed locally on consumer-grade hardware such as Mac Studio M4 Max and NVIDIA DGX Spark to achieve both data privacy and high performance.

Step 3.5 Main functions of Flash

  • High speed reasoning : The model achieves a generation speed of up to 350 TPS through MTP-3 technology, supporting instant response for complex multi-step reasoning.
  • Agent capabilities : The model is specially designed for agent tasks, reaching 74.4% in SWE-bench Verified, and can handle long-chain complex tasks.
  • Efficient long text : Supports 256K context windows and uses a hybrid attention mechanism to reduce long text calculation overhead.
  • local deployment : Optimized consumer-grade hardware support and can run smoothly on Mac Studio M4 Max, NVIDIA DGX Spark and other devices.
  • code generation : The model has powerful programming capabilities and supports automatic tool invocation and structured reasoning output.

Step 3.5 Technical principles of Flash

  • Sparse MoE architecture : The model uses a 45-layer Transformer backbone network, with each layer configured with 288 fine-grained routing experts and 1 shared expert. Only the Top-8 experts are activated during inference, and each token actually calculates about 11 billion parameters, achieving a balance between the model capability of a total parameter scale of 196 billion and the cost of small model inference.
  • MTP-3 Multi-Token Prediction : Through a dedicated prediction head composed of a sliding window attention mechanism and a dense feedforward network, 4 tokens are generated in parallel in a single forward propagation. Increase the typical scene generation speed to 100-300 tok/s, with a peak of 350 tok/s, significantly reducing decoding latency.
  • Hybrid attention mechanism : Adopts an architecture design that alternates between 3:1 sliding window attention and global attention layers. The sliding window layer focuses on local context, and the global layer captures long-distance dependencies, effectively controlling computational complexity in 256K long text scenarios, taking into account both efficiency and performance.
  • Reasoning optimization strategy : The model supports the combined deployment of expert parallelism (EP8) and tensor parallelism (TP8), and cooperates with FP8 to quantitatively reduce memory bandwidth pressure. Through the collaboration of speculative decoding and MTP, efficient service-oriented deployment is achieved on Hopper GPU.

Step 3.5 Flash project address

Step 3.5 Flash application scenarios

  • Intelligent programming development : As the underlying model of Claude Code, Codex and other tools, it provides code generation, automatic debugging, software engineering task processing and other capabilities, achieving a pass rate of 74.4% on SWE-bench Verified.
  • Autonomous agent execution : Suitable for in-depth research, web page information retrieval, cross-platform data comparison and other Agent scenarios that require long chain reasoning.
  • real-time conversational interaction : With a generation speed of 100-350 TPS, it supports interactive applications that require instant response, such as low-latency chatbots, online educational tutoring, and intelligent customer service.
  • Long text analysis and processing : Can be used for reading academic papers, reviewing legal contracts, understanding large code bases, and efficiently extracting and integrating massive amounts of information.
  • Device-side privacy computing : Can be deployed on local devices such as Mac Studio M4 Max and NVIDIA DGX Spark to meet the needs of privatized processing of sensitive data in finance, medical, corporate office, etc. ©