Thinker - UBTECH's open-source embodied intelligent visual language model

Homepage • AI Tools • AI Projects and Frameworks • Thinker - UBTECH's open-source embodied intelligent visual language model...

Thinker - UBTECH's open-source embodied intelligent visual language model
  • Home page•* AI tools *•*AI projects and frameworks *•*Thinker – UBTECH’s open source embodied intelligent visual language model
    Thinker is a large open-source embodied intelligent visual language model from Ubiselect, specially built for robot scenarios. Model 4B parameters ranked first in the world in 9 authoritative benchmark tests. The core capabilities of the model cover task planning, spatial understanding, temporal reasoning and visual positioning, which can effectively solve the dilemma of “what the robot wants but cannot grasp”. The model is trained based on 10 million high-quality data refined from 2 billion raw data, and uses an automated annotation system to control the manual participation rate below 1%. The model has supported Walker S2 to achieve 99.99% operating accuracy in industrial scenarios, promoting the inclusive development of embodied intelligence technology.

Main functions of Thinker

  • mission planning : Thinker can understand complex human instructions, combine historical state memory, predict future state changes of the robot, and decompose long-range tasks into executable sub-task sequences.
  • spatial understanding : Thinker has established a self-centered coordinate system, using the camera as the origin to define all spatial relationships, allowing the robot to accurately perceive the position and orientation of objects in three-dimensional space.
  • time understanding : Thinker can extract key information from video history, combine past events with current instructions, and accurately evaluate the current status to make reasonable timing decisions.
  • visual positioning : Thinker can describe the position of objects in the form of bounding boxes and precise point coordinates, providing precise spatial guidance for robot grasping operations and interactions.

Thinker’s technical principles

  • Data construction : Thinker builds a complete pipeline from raw data to high-quality training data. Faced with 2 billion pieces of noisy and difficult-to-align raw data, we conducted extensive screening through customized rules, used large models to perform multi-dimensional quality scoring, and refined 10 million pieces of high-quality data. At the same time, an automated labeling system of “large model auxiliary labeling plus multi-model cross-validation” is adopted to control the manual participation rate below 1%, reducing labeling costs by 99% and increasing efficiency by more than 100 times.
  • Model architecture design : Thinker adopts the classic visual language model architecture, including four core modules: text segmenter, visual encoder, multi-layer perceptron alignment layer and language model backbone. Achieve unified representation of vision, language and time, so that the model can accurately capture visual details, understand task instructions and perform cross-modal reasoning.
  • training strategy : Thinker adopts a two-stage training method. In the first stage, fine-tuning is carried out on general data sets, spatial understanding data sets and large-scale planning data sets to establish basic perception and reasoning capabilities. At the same time, the last frame of the video is introduced as auxiliary input to enhance video understanding. In the second stage, supervised fine-tuning is carried out on the industrial task data set to adapt the model to sequence dependencies, diverse object layouts and feedback corrections, and finally generate planning solutions that can be executed in real industrial scenarios.
  • key innovation : Thinker addresses the pain points of robot perspective confusion and missing video information, and proposes a simple and effective method of jointly inputting key frames and complete videos in video understanding training, which significantly enhances the model’s timing understanding capabilities. At the same time, through high-quality data screening and task-oriented sampling, the performance of models exceeding 10B can be achieved with only 4B parameters.

Thinker’s project address

Thinker application scenarios

  • Industrial intelligent manufacturing : Thinker can drive humanoid robots to complete tasks such as box handling and workpiece sorting on the factory production line. Walker S2 has achieved an operation accuracy of 99.99%, effectively solving the problem of insufficient flexibility of traditional automation equipment.
  • Warehousing and logistics operations : Thinker supports robots to carry out cargo identification, path planning and precise grabbing in a dynamic warehouse environment, adapting to the logistics needs of diversified SKUs and high-frequency changes.
  • Commercial service scenario : Thinker empowers robots to provide guidance, explanation and interactive services in public places such as shopping malls and exhibition halls, achieving natural human-computer interaction through visual language understanding.
  • Complex operational tasks : Thinker enables robots to perform operations that require long-range planning and fine spatial awareness, such as equipment inspections, parts assembly, and multi-step experimental processes.
  • Swarm intelligence collaboration : Thinker serves as a cognitive base to support Youbi’s group brain network and collaborative intelligence Co-Agent, realizing task distribution, collaborative decision-making and autonomous evolution among multiple robots. ©