Solaris - A multi-user video world generation model open-sourced by Xie Saining's research team

Solaris is the first multiplayer video world generation model that can simultaneously generate a consistent first-person perspective for two players in Minecraft. Breaking away from the limitations of existing models that only support single-player modes, it ensures spatial consistency across player perspectives—when one player builds or moves, the other's perspective reflects the changes synchronously.

Solaris - A multi-user video world generation model open-sourced by Xie Saining's research team
  • Home page•* AI tools *•*AI projects and frameworks *•*Solaris – an open source multiplayer video world generation model by Xie Saining’s research team
    Solaris is the first multiplayer video world generation model that can simultaneously generate a consistent first-person perspective for two players in Minecraft. The model breaks through the limitations of existing models that only support a single player and ensures spatial consistency across player perspectives - when one player builds or moves, the other perspective reflects the changes simultaneously. The team self-developed the SolarisEngine data system, collected 12.6 million frames of multiplayer game data, and innovatively launched the Checkpointed Self Forcing training method to solve the long sequence memory bottleneck.

Main features of Solaris

  • Synchronous generation of multiple perspectives : Solaris can generate consistent first-person videos for two players at the same time, ensuring spatial consistency across players’ perspectives. When one player performs an action, the other player’s perspective will reflect changes in real time.
  • Stable generation of long-term sequences : Through Checkpointed Self Forcing technology, Solaris can generate stable video sequences up to 224 frames (11.2 seconds), effectively avoiding visual degradation caused by error accumulation.
  • action condition control : The model accepts complete Minecraft action input (including movement, camera, digging, placement, etc.), and the generated video strictly follows the given action sequence.
  • Complex dynamic simulation : Solaris can simulate complex game dynamics such as backpack status synchronization, weather changes, physical construction destruction, PvP battles, etc.

Technical principles of Solaris

  • Multiplayer DiT architecture : Based on the single-player diffusion Transformer of MatrixGame 2.0, it supports complete Minecraft input by expanding the action space, introduces a cross-player self-attention layer to realize dual-player information exchange, and adds player ID embedding to distinguish different perspectives. The remaining modules (cross-attention, FFN) keep the single-player settings unchanged.
  • Four stages of progressive training : Starting from single-player pre-training weights, fine-tuning and adapting to the Minecraft action space on the VPT data set, then switching to multi-player data to train a two-way model as a teacher, causally transformed into a sliding window generator, and achieving stable generation of long sequences through Checkpointed Self Forcing.
  • Checkpointed Self Forcing : In order to solve the memory bottleneck of sliding window autoregression, this method first generates and caches clean frames and noise states without gradients. Through a single parallel recalculation of the customized attention mask, it strictly reproduces the sliding window dependency and saves the memory from down to , and also supports KV cache gradient backhaul to improve generation quality.
  • SolarisEngine data system : In response to the lack of multiplayer support in the existing framework, the team built a separation architecture between the controller and the camera of the official Minecraft client based on Mineflayer, synchronized the status in real time through the server plug-in, used Docker containerization to achieve parallel expansion and automatic fault recovery, and finally collected 12.6 million frames of action-labeled multiplayer game data.

Solaris project address

Solaris application scenarios

  • Embodied Intelligence Training and Assessment : As a multi-agent world simulator, it provides synthetic training data for robots and game AI, supports strategy learning, inference-time planning and safety assessment, and avoids the high cost of trial and error in real environments.
  • Multi-agent collaboration research : Simulates multi-person collaborative tasks (such as joint construction, team combat), used to train the collaboration and communication capabilities of AI agents, and study emergent behavior and social intelligence.
  • Vision-Language-Action Model Development : The model can generate large-scale multi-view video-action-language alignment data, support pre-training and fine-tuning of VLA models, and make up for the scarcity of real human multi-person interaction data.
  • 3D scene understanding and spatial reasoning benchmark : As a controllable test platform, it evaluates the model’s performance in core 3D understanding capabilities such as perspective consistency, object persistence, and spatial memory. ©