Enterprise-grade AI that actually runs on premises, not just in theory

command-a-plus_01

Cohere just shipped something that makes me wonder why more labs don't do this. A 218-billion-parameter model that runs on two H100s and comes with no strings attached. Not a "contact sales" license. Not a "you can use it but we get to peek" model. Apache 2.0. Weights on Hugging Face. Go.

The pitch is simple and it lands hard: if your enterprise AI strategy involves sending sensitive data to some API endpoint in a jurisdiction you don't control, you have a problem that gets worse every quarter. Command A+ is Cohere's answer, a frontier-ish model you can run in your own VPC, on prem, or fully air gapped, with the same capabilities you'd get from something hosted at OpenAI or Anthropic.

Architecture

Command A+ is a Sparse Mixture-of-Experts Transformer: 218B total parameters, only 25B active per token. That 8.7x sparsity ratio is the whole game. You get the capacity of a 218B model at inference costs closer to a 25B one.

The MoE uses 128 experts with 8 active per token plus one shared expert applied to all tokens. Attention is interleaved sliding-window (with RoPE) and global attention in a 3:1 ratio. Context window: 128K input, 64K generation.

What impressed me most is the quantization story. Cohere uses NVFP4 W4A4 for the MoE experts while keeping the attention pathways at full precision. They compensate with Quantization-Aware Distillation (QAD) and claim the quality loss is "imperceptible." At W4A4, Command A+ runs on a single B200 or two H100s. That is genuinely impressive for a 218B model.

Quantization	B200	H100
BF16 (16-bit)	4×	8×
FP8 (8-bit)	2×	4×
W4A4 (4-bit)	1×	2×

The model also uses a new tokenizer that reduces tokens by 20% for Arabic, 18% for Japanese, and 16% for Korean compared to the previous generation. That's meaningful for enterprises serving non-English markets.

Benchmarks & Performance

Command A+ unifies four prior models (Command A, Reasoning, Vision, Translate) into one architecture. The benchmark results tell the story:

Agentic tasks saw the biggest jumps:

τ²-Bench Telecom: 37% → 85% (+48 points)
Terminal-Bench Hard: 3% → 25% (+22 points)
AIME 25 (Math): 57% → 90% (+33 points)

Multimodal reasoning:

MMMU Pro: 63%
MMMU: 75.1%
MathVista: 80.6%

On the Artificial Analysis Intelligence Index, Command A+ scores 37, on par with Claude 4.5 Haiku and above NVIDIA Nemotron 3 Super and Gemini 3.1 Flash-Lite. Its non-hallucination reliability is 86%, which is interesting because Cohere seems to have optimized for "knowing when to say I don't know" rather than maximizing raw accuracy at any cost. The model's AA-Omniscience profile shows it knows its limits.

Speed is competitive: roughly 281 output tokens per second on Cohere's API (faster than GPT-5.4 nano and Claude 4.5 Haiku, slightly slower than Gemini 3.1 Flash-Lite at 304 tok/s). With speculative decoding enabled, you get an additional 1.5-1.6x speedup on the MoE architecture.

Community Reaction

The r/LocalLLaMA thread came with a personal touch. Nick Frosst, Cohere co-founder, posted the announcement himself, referencing an earlier comment from co-founder Aidan Gomez promising more powerful open-weight models. The thread is less "wow this is amazing" enthusiasm and more "let's see the numbers" pragmatism, which is probably the right reaction for an enterprise model.

Some notable points from the discussion:

The Apache 2.0 license is the real headline. Previous Cohere models were CC-BY-NC, which limited commercial use. This is their first fully permissive release.
The W4A4 quantization on 2 H100s means mid-size enterprises can actually run this without a GPU cluster budget.
One recurring question: how does it compare to Qwen 3.5 at similar active parameter counts? Cohere hasn't published those direct comparisons yet.

Sources

Cohere Blog: https://cohere.com/blog/command-a-plus
Hugging Face Model: https://huggingface.co/CohereLabs/command-a-plus-05-2026-bf16
VentureBeat Analysis: https://venturebeat.com/technology/cohere-cracks-lossless-quantization-and-native-citations-with-first-full-apache-2-0-licensed-open-model-command-a
MarkTechPost: https://www.marktechpost.com/2026/05/21/cohere-releases-command-a-a-218b-sparse-moe-model-for-agentic-workflows-that-runs-on-as-few-as-two-h100-gpus
Artificial Analysis: https://artificialanalysis.ai/articles/cohere-launches-open-weights-model-command-a-more-than-a-year-since-the-command-a-release
BusinessWire: https://www.businesswire.com/news/home/20260520121796/en/Cohere-Releases-Command-A-An-Open-Source-Enterprise-AI-Model-Built-for-Sovereign-Critical-Infrastructure
Reddit (Nick Frosst announcement): https://www.reddit.com/r/LocalLLaMA/comments/1tizmar/re_what_ever_happened_to_coheres_commanda_series/
Las Vegas Sun: https://lasvegassun.com/news/2026/may/20/cohere-releases-command-a-an-open-source-enterpris

So What

This release matters for a specific reason that has nothing to do with benchmark scores. Cohere is the first major Western lab to bet its enterprise strategy on full openness. Apache 2.0 on a 218B MoE model that runs on two H100s is a real product, not a research demo.

The "sovereign AI" pitch is timely. Europe has the EU AI Act. Financial services and healthcare have data residency requirements. Defense contractors have air-gap mandates. These customers cannot use OpenAI, Anthropic, or Google Cloud APIs for their core workflows. They need models that ship inside their perimeter. Command A+ is the most credible answer to that need I've seen from a Western vendor.

The open question is ecosystem. Cohere has $1.6B in funding and the "Attention Is All You Need" pedigree, but they lack the community momentum of Llama, the distribution of Google, or the developer mindshare of OpenAI. Apache 2.0 helps. But tooling, fine-tuning recipes, and community adapters matter more than license terms when you're building production systems.

Worth watching: whether Cohere's recent merger with Aleph Alpha accelerates their European go-to-market. German sovereign AI infrastructure plus a Canadian open-source model is a stronger story than either alone.

Architecture

Benchmarks & Performance

Community Reaction

Sources

So What

RELATED_ENTRIES

That 27B model was too big for a phone. Not anymore.

$4.40 per million tokens just matched the $200 tier

AI coding costs hit $2,000 per engineer and budgets blew up