Building a working robot today means stitching together at least five separate AI models. One sees the world. One reasons about it. One predicts what happens next. One figures out the actions. One generates training data to start over. The handoff between each model adds latency, complexity, and a thousand ways for things to break.

NVIDIA just collapsed all five into a single model. Cosmos 3, launched at GTC Taipei on May 31, is the first open omnimodel that natively handles vision reasoning, world generation, and robot action prediction in one architecture. No more piping outputs between specialist models. No more debugging which model in the chain silently corrupted the reasoning.


The two-tower architecture that changes everything

Cosmos 3 uses a Mixture-of-Transformers design with two towers that talk to each other through shared attention:

Reasoner Tower is an autoregressive vision-language model. It takes camera feeds, text instructions, or both, and builds a structured understanding of the scene. Where is the object? What does the robot need to do? What obstacles exist?

Generator Tower is a diffusion-based transformer. It takes the Reasoner's output and produces future video frames, synthetic worlds, or joint-angle trajectories for a robot arm to execute.

The key constraint: the Generator is strictly conditioned on the Reasoner. It cannot generate physically implausible outputs because it has to build on the Reasoner's grounded understanding first. Both towers share a 3D multi-dimensional rotary position embedding (mRoPE) so spatial-temporal consistency is enforced at the architecture level, not bolted on as a post-processing step.

This is a meaningful departure from Cosmos 2, which was primarily a world generation model. Cosmos 3 adds native action generation, closing the loop from perception to physical execution in one forward pass.

Five modes, one set of weights

The same Cosmos 3 model can be configured for five different use cases without changing its parameters:

  1. Vision Language Mode: Feed it video and text, get scene descriptions and reasoning about what's happening.
  2. World Model Mode: Generate physically plausible video sequences from text prompts or reference images.
  3. Forward Dynamics Mode: Given a current image and a proposed action, predict what the next frame will look like.
  4. Inverse Dynamics Mode: Watch a demonstration video and extract the exact action trajectories that produced it.
  5. Policy Mode: The full loop. Give it a goal, and it outputs both a predicted world rollout and the specific joint-angle commands a robot needs to execute.

That last mode is the one that matters most for anyone building physical AI. A single model that can go from "pick up the red cup" to motor commands, without any intermediate handoffs, is a completely different development workflow.

Three sizes for three hardware tiers

Tier Parameters Target Hardware Status
Super 64B (32B + 32B) Hopper / Blackwell datacenter Available now
Nano 16B (8B + 8B) RTX PRO 6000 workstation Available now
Edge 4B Jetson embedded devices Coming soon

The Nano tier is the one most teams should prototype with. At 16B parameters, it runs on a single workstation GPU and covers the full five-mode functionality. The Super tier is for large-scale synthetic data generation where you need massive batch throughput. The Edge tier at 4B is aimed at real-time inference on embedded devices, but it has not shipped yet and NVIDIA has not given a release date.

Benchmark performance

NVIDIA claims Cosmos 3 leads among open models across multiple physical AI benchmarks. Their numbers (vendor-stated, not independently verified):

Benchmark What it measures Cosmos 3 result
Physics-IQ Understanding of physical laws First among open models
PAI-Bench Physical AI reasoning First among open models
Artificial Analysis World generation accuracy First among open models
RoboLab / RoboArena Action policy quality First among open models
VANTAGE-Bench Vision understanding First among open models

"First among open models" is doing a lot of work in those claims. NVIDIA is comparing against other open-weight models, not against proprietary systems like Google's internal robotics models or OpenAI's vision APIs. That is the relevant comparison for the target audience (teams building with open-source tooling), but the framing matters.

NVIDIA also introduced a Cosmos Human Evaluation framework for cases where automated benchmarks struggle to differentiate high-performing models. This suggests they are aware the leaderboard numbers alone do not tell the full story.

The open-source play is the real story

The technical architecture is interesting. The licensing and ecosystem strategy is what actually moves the industry.

OpenMDW-1.1 license from the Linux Foundation permits commercial use, modification, and redistribution. Products using Cosmos 3 must display "Built on NVIDIA Cosmos," but there are no revenue-sharing requirements, no usage caps, and no API lock-in. Compare this to Google's robotics models (internal only) or OpenAI's vision APIs (per-token pricing).

Six synthetic datasets shipped alongside the model: Embodied-Robot-Scenes, Physical-Interaction-Scenes, Spatial-Reasoning, Digital-Human-Scenes, Autonomous-Driving-Scenarios, and Warehouse-Operations-Scenes. These are generated by NVIDIA's own Cosmos pipeline and released under the same open license. For teams that cannot afford to build world simulation infrastructure from scratch, these datasets are the real on-ramp.

Cosmos Coalition members include Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI. The coalition is structured to share evaluation techniques and training tools through NVIDIA DGX Cloud. This is not just a partner list for a press release. These are companies that are actively contributing to and building on the Cosmos architecture.

Production adopters: LG Electronics, Samsung, Doosan Robotics, Li Auto, and Skild AI are named as current users. These are not proof-of-concept partnerships. Li Auto is using it for autonomous driving simulation. Samsung and LG are integrating it into robotics workflows.

Don't confuse it

The Cosmos lineage is getting crowded. Here is what shipped when:

  • Cosmos 1 (January 2025): World generation model. Text-to-video for physical AI simulation. No action generation.
  • Cosmos Predict 2 / 2.5 (2025): Improved world prediction. Added image-to-video. Still no native action output.
  • Cosmos Reason (2025): Vision-language model for scene understanding. Separate from the generation pipeline.
  • Cosmos 3 (May 2026): Unified omnimodel. Merges reasoning, generation, AND action into one architecture. This is the first version where a single model can output robot motor commands.

If someone tells you "we tried Cosmos and it did not work for robotics," ask them which version. Cosmos 1 and 2 were world generation tools. Cosmos Reason was a perception model. Cosmos 3 is the first version that actually closes the perception-to-action loop.

Pricing and deployment

Cosmos 3 is free to download from Hugging Face. NVIDIA NIM microservices provide managed deployment on Azure, CoreWeave, Baseten, Nebius, and Deep Infra for teams that do not want to self-host. You can also try it GPU-free at build.nvidia.com.

The toolchain includes Cosmos Curator (data filtering), Cosmos Evaluator (scoring), NVIDIA TAO 7 (fine-tuning), and the Cosmos Cookbook (recipes and examples). All open source.


What surprised me

The most interesting thing about Cosmos 3 is not any single benchmark number. It is that NVIDIA chose to release this at all.

World models for physical AI have been among the most expensive, hardest-to-build AI systems in existence. They require massive simulation infrastructure, petabytes of physical interaction data, and teams of specialized researchers. By open-sourcing the model, the training recipes, and six synthetic datasets, NVIDIA is essentially giving away the infrastructure that used to be a competitive moat.

The reason makes sense if you think about NVIDIA's actual business. They sell GPUs. The more teams building physical AI, the more Hopper and Blackwell clusters they sell. Cosmos 3 is not charity. It is a market expansion strategy that happens to benefit everyone who wants to build robots without a $100M compute budget.

The Cosmos Coalition formalizes this. Every member company contributes back to the ecosystem while also being locked into NVIDIA's hardware and DGX Cloud infrastructure. It is a classic platform play: give away the software, sell the compute.

What remains to be seen is whether the "first among open models" performance translates to production quality. The benchmarks are vendor-reported, and automated evaluation of physical AI is notoriously unreliable. NVIDIA's introduction of their own Human Evaluation framework suggests they know this. The real test is whether teams like Li Auto and Samsung can actually deploy Cosmos 3-powered systems in production, not just run benchmark suites.

For anyone building robotics or autonomous vehicle pipelines today: the five-model stack is officially on life support. Cosmos 3 does not guarantee it works, but it guarantees the alternative exists. And the alternative is free.