InternVL-U - An open-source multimodal integrated model from Shanghai AI Lab and other sources

InternVL-U is a 4B parameter lightweight unified multi-modal model that is open sourced by Shanghai Artificial Intelligence Laboratory and many top universities. It realizes the end-to-end closed loop of “understanding-reasoning-generation-editing” for the first time. The model adopts three core designs of “unified context modeling + modality-specific modularization + decoupled visual representation” to break through the bottlenecks of traditional model training costs and uneven capabilities. The model surpasses 14B-level models in complex scenarios such as text rendering, scientific reasoning, and spatial modeling. The GenExam scientific research image generation benchmark score of 22.9 leads all open source unified models, providing efficient and flexible multi-modal solutions for scientific research and education, smart office, creative content and other scenarios.

Main functions of InternVL-U

multimodal understanding : Supports accurate analysis of visual information in images and answers various complex questions raised by users.
logical reasoning : The model uses thinking chain technology to decompose abstract natural language instructions into executable specific operation steps.
image generation : Generate high-fidelity, semantically accurate and aesthetically compliant visual images based on text descriptions.
image editing : Precisely modify the content of the specified area of the image while retaining the original background texture and lighting effects.
text rendering : The model can accurately generate Chinese, English, numbers and mathematical symbols, completely eliminating glyph distortion and spelling errors.
scientific visualization : Supports the drawing of molecular structures, algorithm flow charts and other professional scientific research illustrations that comply with discipline specifications.
spatial modeling : The model can complete three-dimensional geometric operations, CAD multi-view conversion and rotation operations of three-dimensional objects at any angle.
Interesting creation : InternVL-U can quickly generate interesting and creative content suitable for network communication scenarios such as emoticons and memes.

Technical principles of InternVL-U

Decoupling visual representations : InternVL-U adopts an asymmetric visual representation strategy. In the understanding task, it uses pre-trained ViT to extract high semantic features to ensure the accuracy of complex scene understanding. In the generation task, the image is compressed into latent space through independent VAE to retain pixel-level details. The model avoids optimization conflicts between semantic understanding and image reconstruction, allowing the model to maintain leading performance in both understanding and generation benchmarks.
Dual-stream MMDiT generation header : The visual generation head uses a dual-stream structure to process multi-modal context features and image latent features respectively, adjusts weights through a sigmoid gated attention mechanism to alleviate performance degradation in long context scenarios, uses unified MSRoPE three-dimensional position coding to ensure accurate retention of spatial structure, and supports multi-resolution generation from 512 to 1024 pixels to avoid splicing artifacts at high resolutions.
Level 3 Progressive Training : The model adopts a three-level strategy of pre-training, continuous pre-training and fine-tuning. In the first stage, the frozen backbone network trains the generation head to activate multi-modal context condition understanding capabilities. In the second stage, the fixed backbone network trains multi-resolution generation capabilities and selects high-aesthetic samples. In the third stage, the full model is unfrozen and integrated into the thinking chain data to achieve deep collaboration of understanding, reasoning and generation.

InternVL-U project address

GitHub repository ：https://github.com/OpenGVLab/InternVL-U
HuggingFace model library ：https://huggingface.co/InternVL-U/InternVL-U
arXiv technical papers ：https://arxiv.org/pdf/2603.09877

Application scenarios of InternVL-U

Scientific research and education : Provide scientific researchers and students with professional visual content such as molecular structures, algorithm flow charts, force analysis diagrams, etc., to assist in teaching demonstrations and thesis illustration production.
Smart office : Realize automatic document generation, batch editing of posters, and simultaneous modification of multi-region text to improve the production efficiency of business documents and marketing materials.
creative design : Support designers to quickly generate high-fidelity concept drawings, stylized images and multi-resolution visual materials, lowering the threshold for professional design.
Content operation : Help new media operators generate interesting content such as emoticons and memes with one click, adapting to social media communication scenarios.
Industrial manufacturing : The model can complete CAD multi-view conversion, three-dimensional geometric operations and three-dimensional object rotation, assisting engineering design and product prototype visualization. ©

← Previous Nemotron 3 Super - NVIDIA's open-source large model for agent inference Next → Solaris - A multi-user video world generation model open-sourced by Xie Saining's research team

The Gemini 3.1 Flash-Lite is Google's lightweight flagship model, emphasizing extreme cost-effectiveness. With an output speed of 363 tokens per second and an input price of $0.25 per million tokens, it outperforms the GPT-5 mini by 5 times in speed, and costs a quarter of the price of the Claude 4.5 Haiku.

DeepSpeed-MII - Microsoft DeepSpeed's open-source model inference library

DeepSpeed-MII is an open-source Python library from the DeepSpeed team that provides efficient model inference. DeepSpeed-MII significantly improves inference throughput and reduces latency using innovative techniques such as blocking key-value caching, sequential batch processing, and dynamic SplitFuse, demonstrating excellent performance when handling large language models.

MOVA - An end-to-end audio and video model | AI toolset open-sourced by Innovation Academy and Mosin Intelligence.

MOVA (MOSS Video and Audio) is China's first high-performance open-source end-to-end audio and video generation model, jointly developed by the OpenMOSS team at Shanghai Institute of Innovation and Motion Intelligence (MOSI). Breaking through the limitations of traditional "silent" video generation, the model employs a heterogeneous dual-tower architecture and bidirectional bridging modules to achieve native cross-modal interaction. The model boasts 32 billion parameters (MoE architecture, 18 billion inference activations) and can simultaneously generate up to 8 seconds of 720p resolution video and accompanying audio, demonstrating outstanding performance in cinematic lip-sync and environmental sound effect fit. MOVA's main functions...

ACE-Step 1.5 - A music generation model open-sourced by ACE Studio and StepFun

ACE-Step 1.5 is an open-source music generation model jointly developed by ACE Studio and StepFun, enabling commercial-grade music generation on consumer-grade hardware. The model employs a hybrid architecture: a language model acts as a planner, transforming user prompts into song blueprints, while a Diffusion Transformer handles acoustic rendering.