Step 3.5 Flash - The latest open-source pedestal model from Step Star

Step 3.5 Flash is the latest open source base model of Step Star, specially launched for Agent scenarios. The model adopts a sparse MoE architecture with a total of 196 billion parameters, and only 11 billion parameters are activated per token, taking into account both performance and efficiency. Step 3.5 Flash inference speed is up to 350 TPS, supports 256K long context, and is comparable to top closed-source models in mathematical reasoning, code generation (SWE-bench 74.4%) and Agent tasks. Step 3.5 Flash is open source and supports vLLM, SGLang, llama.cpp and other frameworks. It can be deployed locally on consumer-grade hardware such as Mac Studio M4 Max and NVIDIA DGX Spark to achieve both data privacy and high performance.

Step 3.5 Main functions of Flash

High speed reasoning : The model achieves a generation speed of up to 350 TPS through MTP-3 technology, supporting instant response for complex multi-step reasoning.
Agent capabilities : The model is specially designed for agent tasks, reaching 74.4% in SWE-bench Verified, and can handle long-chain complex tasks.
Efficient long text : Supports 256K context windows and uses a hybrid attention mechanism to reduce long text calculation overhead.
local deployment : Optimized consumer-grade hardware support and can run smoothly on Mac Studio M4 Max, NVIDIA DGX Spark and other devices.
code generation : The model has powerful programming capabilities and supports automatic tool invocation and structured reasoning output.

Step 3.5 Technical principles of Flash

Sparse MoE architecture : The model uses a 45-layer Transformer backbone network, with each layer configured with 288 fine-grained routing experts and 1 shared expert. Only the Top-8 experts are activated during inference, and each token actually calculates about 11 billion parameters, achieving a balance between the model capability of a total parameter scale of 196 billion and the cost of small model inference.
MTP-3 Multi-Token Prediction : Through a dedicated prediction head composed of a sliding window attention mechanism and a dense feedforward network, 4 tokens are generated in parallel in a single forward propagation. Increase the typical scene generation speed to 100-300 tok/s, with a peak of 350 tok/s, significantly reducing decoding latency.
Hybrid attention mechanism : Adopts an architecture design that alternates between 3:1 sliding window attention and global attention layers. The sliding window layer focuses on local context, and the global layer captures long-distance dependencies, effectively controlling computational complexity in 256K long text scenarios, taking into account both efficiency and performance.
Reasoning optimization strategy : The model supports the combined deployment of expert parallelism (EP8) and tensor parallelism (TP8), and cooperates with FP8 to quantitatively reduce memory bandwidth pressure. Through the collaboration of speculative decoding and MTP, efficient service-oriented deployment is achieved on Hopper GPU.

Step 3.5 Flash project address

GitHub repository ：https://github.com/stepfun-ai/Step-3.5-Flash/
HuggingFace model library ：https://huggingface.co/stepfun-ai/Step-3.5-Flash

Step 3.5 Flash application scenarios

Intelligent programming development : As the underlying model of Claude Code, Codex and other tools, it provides code generation, automatic debugging, software engineering task processing and other capabilities, achieving a pass rate of 74.4% on SWE-bench Verified.
Autonomous agent execution : Suitable for in-depth research, web page information retrieval, cross-platform data comparison and other Agent scenarios that require long chain reasoning.
real-time conversational interaction : With a generation speed of 100-350 TPS, it supports interactive applications that require instant response, such as low-latency chatbots, online educational tutoring, and intelligent customer service.
Long text analysis and processing : Can be used for reading academic papers, reviewing legal contracts, understanding large code bases, and efficiently extracting and integrating massive amounts of information.
Device-side privacy computing : Can be deployed on local devices such as Mac Studio M4 Max and NVIDIA DGX Spark to meet the needs of privatized processing of sensitive data in finance, medical, corporate office, etc. ©

← Previous Happy - An open-source AI programming remote control tool for real-time status monitoring Next → GLM-OCR - A lightweight multimodal OCR model from Zhipu Open Source

Tidy is a cloud-based personal AI agent that allows users to interact with it anytime via iMessage or web browser. Users don't need to write code; simply demonstrate the process, and the tool will learn to use any website, transforming it into a reusable automation tool. Tidy possesses long-term memory, scheduled tasks, and file processing capabilities, helping users search for properties, book flights, and summarize news. The tool supports community tool sharing, allowing users to easily share self-built tools with friends, making AI like a helpful personal assistant always at your fingertips. Tidy's main features include: No-code tool building: Demonstrating the operation process teaches...

Qwen-Image-2.0 - A fundamental image generation model launched by Alibaba's Tongyi Qianwen

Qwen-Image-2.0 is a new generation image generation model launched by Alibaba's Tongyi Qianwen, supporting two core capabilities: accurate text rendering and realistic texture detail. The model supports 1k token long commands to directly output professional infographics, PPTs, and posters, and natively renders details of people, nature, and architecture at 2K resolution.

Yuanbao Pai - Tencent Yuanbao's AI-powered social feature

What is Yuanbao Pai? Yuanbao Pai is an AI social feature launched by Tencent's Yuanbao app, making the AI Yuanbao a formal member of group chats and building a "human-machine symbiotic" social space. Users can @Yuanbao to chat at any time. Yuanbao Pai has a fun personality, can engage in witty banter and meme battles, and also possesses a super memory, accurately recalling details of group chats. Yuanbao Pai's functions include information summarization, document interpretation, scheduled task reminders, and image creation, supporting scenarios such as remote teaching and online movie viewing. Currently in the internal testing phase, it is attracting users to experience this new AI social model through a "share 1 billion yuan in red envelopes" campaign. Yuanbao Pai's main functions...

Claude Opus 4.6 - Anthropic's latest programmable AI model

Claude Opus 4.6 is Anthropic's flagship AI model, an upgrade from Claude Opus 4.5. The model is the first to support an ultra-long context window of 1 million tokens, leading across the board in programming, inference, and complex task processing. Claude Opus 4.6 surpasses Terminal-Bench 2.0, Humanity’s Last...