Thinker - UBTECH's open-source embodied intelligent visual language model

Home page•* AI tools *•*AI projects and frameworks *•*Thinker – UBTECH’s open source embodied intelligent visual language model
Thinker is a large open-source embodied intelligent visual language model from Ubiselect, specially built for robot scenarios. Model 4B parameters ranked first in the world in 9 authoritative benchmark tests. The core capabilities of the model cover task planning, spatial understanding, temporal reasoning and visual positioning, which can effectively solve the dilemma of “what the robot wants but cannot grasp”. The model is trained based on 10 million high-quality data refined from 2 billion raw data, and uses an automated annotation system to control the manual participation rate below 1%. The model has supported Walker S2 to achieve 99.99% operating accuracy in industrial scenarios, promoting the inclusive development of embodied intelligence technology.

Main functions of Thinker

mission planning : Thinker can understand complex human instructions, combine historical state memory, predict future state changes of the robot, and decompose long-range tasks into executable sub-task sequences.
spatial understanding : Thinker has established a self-centered coordinate system, using the camera as the origin to define all spatial relationships, allowing the robot to accurately perceive the position and orientation of objects in three-dimensional space.
time understanding : Thinker can extract key information from video history, combine past events with current instructions, and accurately evaluate the current status to make reasonable timing decisions.
visual positioning : Thinker can describe the position of objects in the form of bounding boxes and precise point coordinates, providing precise spatial guidance for robot grasping operations and interactions.

Thinker’s technical principles

Data construction : Thinker builds a complete pipeline from raw data to high-quality training data. Faced with 2 billion pieces of noisy and difficult-to-align raw data, we conducted extensive screening through customized rules, used large models to perform multi-dimensional quality scoring, and refined 10 million pieces of high-quality data. At the same time, an automated labeling system of “large model auxiliary labeling plus multi-model cross-validation” is adopted to control the manual participation rate below 1%, reducing labeling costs by 99% and increasing efficiency by more than 100 times.
Model architecture design : Thinker adopts the classic visual language model architecture, including four core modules: text segmenter, visual encoder, multi-layer perceptron alignment layer and language model backbone. Achieve unified representation of vision, language and time, so that the model can accurately capture visual details, understand task instructions and perform cross-modal reasoning.
training strategy : Thinker adopts a two-stage training method. In the first stage, fine-tuning is carried out on general data sets, spatial understanding data sets and large-scale planning data sets to establish basic perception and reasoning capabilities. At the same time, the last frame of the video is introduced as auxiliary input to enhance video understanding. In the second stage, supervised fine-tuning is carried out on the industrial task data set to adapt the model to sequence dependencies, diverse object layouts and feedback corrections, and finally generate planning solutions that can be executed in real industrial scenarios.
key innovation : Thinker addresses the pain points of robot perspective confusion and missing video information, and proposes a simple and effective method of jointly inputting key frames and complete videos in video understanding training, which significantly enhances the model’s timing understanding capabilities. At the same time, through high-quality data screening and task-oriented sampling, the performance of models exceeding 10B can be achieved with only 4B parameters.

Thinker’s project address

GitHub repository ：https://github.com/UBTECH-Robot/Thinker
HuggingFace model library ：https://huggingface.co/UBTECH-Robotics/Thinker-4B
arXiv technical papers ：https://arxiv.org/pdf/2601.21199

Thinker application scenarios

Industrial intelligent manufacturing : Thinker can drive humanoid robots to complete tasks such as box handling and workpiece sorting on the factory production line. Walker S2 has achieved an operation accuracy of 99.99%, effectively solving the problem of insufficient flexibility of traditional automation equipment.
Warehousing and logistics operations : Thinker supports robots to carry out cargo identification, path planning and precise grabbing in a dynamic warehouse environment, adapting to the logistics needs of diversified SKUs and high-frequency changes.
Commercial service scenario : Thinker empowers robots to provide guidance, explanation and interactive services in public places such as shopping malls and exhibition halls, achieving natural human-computer interaction through visual language understanding.
Complex operational tasks : Thinker enables robots to perform operations that require long-range planning and fine spatial awareness, such as equipment inspections, parts assembly, and multi-step experimental processes.
Swarm intelligence collaboration : Thinker serves as a cognitive base to support Youbi’s group brain network and collaborative intelligence Co-Agent, realizing task distribution, collaborative decision-making and autonomous evolution among multiple robots. ©

← Previous SecondMe Book - A domestically developed AI agent social platform that supports real-person posting Next → Qwen3-Coder-Next - Tongyi Qianwen's Open Source Programming Intelligent Agent MoE Model

Claude Sonnet 4.6 is Anthropic's latest generation AI model, positioned as a balance between high performance and cost-effectiveness. It features comprehensive upgrades in core capabilities such as programming, computer operation, long text reasoning, and agent planning, with performance approaching that of the flagship Opus 4.6, while its API pricing is only one-fifth of it. Sonnet...

LTX-2.3 - Lightricks' latest open-source video generation model

LTX-2.3 is the latest generation video generation model open-sourced by the Israeli AI company Lightricks. It adopts the Diffusion Transformer architecture and has 22 billion parameters. The model supports three input methods: text, image, and audio to generate videos, and can output videos at a maximum resolution of 4K. It also natively supports 9:16 portrait format and 24/48FPS frame rate selection.

Paperclip - An open-source AI agent orchestration platform, operated by an AI company

Paperclip is an open-source AI agent orchestration platform that allows users to organize multiple AI agents (such as OpenClaw, Claude, and Cursor) into a true "cyber company." The platform provides a complete enterprise management architecture: organizational structure, goal alignment, task delegation, budget control, and governance auditing. AI can collaborate like employees: the CEO Agent sets strategy, the PM Agent breaks down requirements, the Dev Agent writes code, and the QA Agent oversees quality control. Humans act as the board of directors, approving decisions and intervening as needed to prevent...