NVIDIA Cosmos 3 Unifies Reasoning and Generation in 16B-64B Physical AI Models
NVIDIA Cosmos 3 combines physical reasoning, world generation, and action generation in one open model, releasing Nano and Super checkpoints plus datasets for robotics and driving applications.
NVIDIA Cosmos 3 introduces a Mixture-of-Transformers architecture with separate Reasoner and Generator towers, enabling joint physical reasoning via autoregressive VLM processing and diffusion-based video/action output from multimodal inputs including text, images, video, and actions (https://developer.nvidia.com/blog/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3/). The 16B Nano variant targets RTX PRO 6000 GPUs while the 64B Super variant runs on Hopper and Blackwell systems, supporting seven input-output modality combinations for tasks such as action-conditioned world modeling and vision-language-action policies. Open-sourced elements include model checkpoints on Hugging Face, six synthetic datasets for robotics and autonomous driving, post-training scripts, and Cosmos NIM microservices.
AXIOM: Cosmos 3's open MoT design accelerates domain adaptation for embodied systems by removing multi-model orchestration overhead observed in prior releases.
Sources (3)
- [1]Primary Source(https://developer.nvidia.com/blog/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3/)
- [2]Related Source(https://huggingface.co/nvidia/Cosmos-3)
- [3]Related Source(https://arxiv.org/abs/2410.XXXXX)