US Open-Weight AI Finally Has a Fast One

Every US open-weight model released this year has had the same problem: it's either smart or fast, never both. Chinese labs like DeepSeek and Moonshot shipped models that run at 50-100 tokens per second. US equivalents crawled at half that. NVIDIA just fixed it.

Nemotron 3 Ultra is a 550B parameter model that serves over 300 tokens per second. It's the fastest US open-weight model at its intelligence level, and it's not close.

Architecture: Why Mamba Changes Everything for Agents

The standard Transformer has a dirty secret for agent workloads: every new token in a conversation costs the same compute as the first. Your agent calls a tool, gets 2,000 tokens back, and the next inference step processes all of it again. Context grows, costs grow linearly.

Nemotron 3 Ultra uses a hybrid Mamba-Attention stack. Mamba's state-space layers have sub-quadratic scaling. The per-step decode cost stays constant as sequence length grows. This is why throughput gains widen on long, decode-heavy workloads.

The full spec:

550B total parameters, 55B active per token
108 layers, 8192 model dimension
512 experts per layer, top 22 activated (90% sparsity)
64 query heads, 2 key-value heads
1 million token context window
Pre-trained on 20 trillion tokens

The MoE routing means the model is big in knowledge but small in compute per token. 55B active is comparable to a large standalone model, but the full 550B gives it far more specialized knowledge across domains.

Training: Ten Teachers, One Student

NVIDIA didn't just throw compute at this model. They used Multi-teacher On-Policy Distillation, training 10+ domain-specialized teacher models. The student model generates rollouts across domains, and the teachers score every token with dense guidance rather than sparse rewards.

The training hit two loss divergences, one at 8 trillion tokens and another at 16 trillion. The first was fixed by reverting gradient reduction from BF16 to FP32. The second required early learning rate annealing. Both are documented in the technical report, which is unusual transparency for a model release.

Post-training includes Supervised Fine-Tuning, Reinforcement Learning with Verifiable Reward, and the MOPD distillation pass. The result is a model that maintains consistent behavior across different agent frameworks, from OpenHands to custom pipelines.

Benchmarks: Smart Enough, Fast Enough

Nemotron 3 Ultra scores 48 on the Artificial Analysis Intelligence Index. That makes it the most intelligent US open-weight model. It's behind Kimi K2.6 at 54, but ahead of Gemma 4 31B at 39 and Nemotron 3 Super at 36.

The speed numbers are where it pulls away:

Metric	Nemotron 3 Ultra	Typical Chinese Open Models
Tokens/second	300+	50-100
Active params	55B	Varies
Context window	1M tokens	128K-1M
Hardware requirement	Single 8-GPU H100	Multi-node

SWE-Bench Verified: 71.9. RULER at 1M context: 94.7. IOI 2025 competitive programming: 570.0, competitive with top-3 human-level results.

The model supports three reasoning modes. "Medium-effort" uses 2.5x fewer tokens with only a 7% accuracy trade-off. For cost-sensitive agent loops, this is the mode you'd actually use.

Deployment: Single Node, No Compromise

The model ships as a single NVFP4 checkpoint. On Blackwell GPUs, it runs with native FP4 math. On Hopper, it runs as W4A16. The precision mix is NVFP4 routed experts, FP8 shared experts and Mamba linears, and BF16 attention layers.

At 5.03 bits per element, the whole thing fits on a single 8-GPU H100 node. No multi-node setup needed for inference. This matters because most teams don't have clusters to spare for a single model.

The OpenMDW-1.1 license from the Linux Foundation allows commercial use with modification and redistribution. The only requirement is a "Built on NVIDIA Cosmos" attribution on products.

The Nemotron Family

This isn't NVIDIA's first Nemotron 3 release, but it's the one that matters:

Model	Parameters	Active	Target
Nemotron 3 Nano	30B	3B	Edge, local dev
Nemotron 3 Super	120B	12B	Cloud, workstation
Nemotron 3 Ultra	550B	55B	Datacenter, production agents

The Nano runs at 66.6 tokens per second on a Strix Halo. The Ultra is for teams that need the full 550B knowledge base running at production speed.

What Surprised Me

The benchmark gap between US and Chinese open-weight models is real but shrinking fast. Kimi K2.6 at 54 vs Nemotron 3 Ultra at 48 is a 12% gap. Six months ago, that gap was 30%+. NVIDIA is closing it, but not by making a smarter model. They're making a faster one.

That's the right bet. The agent market doesn't care about GPQA Diamond scores. It cares about tokens per second and cost per task. At 300+ tokens per second and 30% lower cost than comparable open models, Nemotron 3 Ultra is the first US open-weight model you could actually put in production for agent workloads.

The CodeRabbit team found a high retry rate in strict-output workflows. That's a real limitation. The model benefits from good harness design and retry logic, not blind deployment. But for teams building agentic coding pipelines where the model is one part of a larger loop, the speed advantage compounds across thousands of calls.

The honest assessment: if you need the absolute smartest model and don't care about speed, Kimi K2.6 still wins. If you need a model that runs fast enough for production agents on US soil with a clean license, Nemotron 3 Ultra is the only option right now.

Architecture: Why Mamba Changes Everything for Agents

Training: Ten Teachers, One Student

Benchmarks: Smart Enough, Fast Enough

Deployment: Single Node, No Compromise

The Nemotron Family

What Surprised Me

Sources

RELATED_ENTRIES

Half the price is easy. Token efficiency is the story.

Open weights just reached 2.8 trillion parameters

Deep Research Got Better When AREX Learned to Doubt Itself