That 3B model matched Claude Opus at math. Benchmarks broken?

Your coding agent keeps failing on long context. Or maybe it is the math problems that trip it up. A team at Sina Weibo just posted a 14-page paper claiming a 3B parameter model matches Claude Opus 4.5 on hard math. Half the AI community is losing its mind. The other half is calling it benchmark theater.

VibeThinker-3B is a compact dense model built on Qwen2.5-Coder-3B, released under MIT license by researchers at Sina Weibo, the Chinese social media giant better known for its microblogging platform than for cutting-edge AI research. The numbers are hard to dismiss: 94.3 on AIME26, 80.2 on LiveCodeBench v6, and a 96.1% acceptance rate on 128 unseen LeetCode problems from April-May 2026. For context, Claude Opus 4.5 scored 95.1 on AIME26. DeepSeek V3.2, a 671B parameter model, scored comparable numbers. Gemini 3 Pro hit 91.7.

The model fits on a single GPU with 6GB of memory. The team claims the post-training cost for their previous 1.5B version was $7,800, compared to $294,000 for larger competitors. Nine researchers authored the paper. The model is available on HuggingFace under MIT license.

Benchmark	VibeThinker-3B	Claude Opus 4.5	DeepSeek V3.2	Gemini 3 Pro
AIME26	94.3 (97.1 w/CLR)	95.1	~94	91.7
LiveCodeBench v6	80.2	-	-	-
HMMT25	89.3 (95.4 w/CLR)	-	-	-
GPQA-Diamond	70.2	~90+	~85+	~88+
IFEval	93.4	-	-	-

Architecture

VibeThinker-3B uses a four-stage post-training pipeline called the Spectrum-to-Signal framework. The core idea: build a broad space of valid reasoning paths first, then amplify the correct ones.

Curriculum SFT: Two-stage supervised fine-tuning. First pass covers broad STEM and coding data. Second pass filters out easy problems and trains only on high-difficulty, long-horizon samples. This forces the model to focus its limited parameters on hard reasoning patterns.
MaxEnt-Guided Policy Optimization (MGPO): Reinforcement learning using a 64K long-context window. The RL focuses on the model's "capability boundary" rather than easy wins. The team calls this Multi-domain Reasoning RL. It targets the frontier where the model is uncertain, not where it is already confident.
Offline Self-Distillation: Merges RL checkpoints into a single student model. This compresses the reasoning gains from multiple RL runs into one coherent model.
Instruct RL: Final instruction-following refinements without compromising reasoning capabilities. This stage ensures the model follows user instructions while preserving the reasoning improvements from stages 1-3.

The model also introduces Claim-Level Reliability Assessment (CLR), a parameter-free test-time scaling method. It generates 32 trajectories per problem, extracts 5 decision-relevant claims per trajectory, has the model verify its own claims using binary verdicts, and weights answers by reliability. Higher-reliability answers get more weight in the final output. With CLR, AIME26 jumps to 97.1 and HMMT25 reaches 95.4.

The Benchmark Debate

Here is where it gets interesting. The model scores 70.2 on GPQA-Diamond, a science knowledge benchmark. Flagship models score much higher. The researchers argue this is by design: reasoning with verifiable answers can be compressed into small models, while open-domain factual knowledge requires broad parameter coverage. They call this the Parametric Compression-Coverage Hypothesis.

The hypothesis splits intelligence into two types. Parameter-Dense intelligence covers reasoning tasks like math and coding where answers can be verified. Parameter-Expansive intelligence covers open-domain factual knowledge that requires broad coverage and inherently demands larger parameter counts.

Critics are not buying it. On X, users report the model fails to recognize common developer tools like uv scripts and struggles with multi-turn conversation consistency. The concern: if a model scores 94.3 on AIME26 but cannot handle a basic coding workflow, what exactly are these benchmarks measuring?

"I genuinely don't know if this is a breakthrough or if the benchmarks are broken." - @orcus108 on X

"The benchmarks are literal pattern matching single file coding. It has no relation to actual coding work." - @BigMoonKR on X

The data contamination question also looms. The authors claim strict decontamination, but skeptics note the model avoids standard industry benchmarks like DeepSWE in favor of math-heavy evaluation sets. VentureBeat described the community reaction as a split between viewing this as a paradigm shift and dismissing it as "benchmaxxing" -- optimizing models specifically to score high on static tests while failing in real-world utility.

What Surprised Me

The cost asymmetry is what caught my attention. $7,800 versus $294,000 for comparable benchmark performance. Even if these benchmarks are narrow, the implication is that specialized reasoning capabilities are becoming cheap to train. The real question is not whether a 3B model replaces Claude Opus. It is whether we have been conflating two different kinds of intelligence this whole time. Math and coding with verifiable answers may be compressible. General knowledge may not be.

If that distinction holds, the industry might split into small specialized reasoning engines paired with large knowledge-rich models. That would change the economics of AI deployment significantly. You could run a reasoning model locally on consumer hardware for math and coding tasks, while offloading knowledge-intensive work to larger cloud-based models.

The VibeThinker team put it best in their technical report: "The development of compact models is no longer merely a passive compromise... it emerges as a promising research trajectory." Whether the benchmarks prove it or not, the cost signal is real. And cost signals tend to move the industry more than benchmark scores.

Sources

arXiv: https://arxiv.org/abs/2606.16140
HuggingFace: https://huggingface.co/WeiboAI/VibeThinker-3B
VentureBeat coverage: https://venturebeat.com/technology/why-weibos-tiny-vibethinker-3b-has-the-ai-world-arguing-over-benchmarks-again
MarkTechPost: https://www.marktechpost.com/2026/06/19/vibethinker-3b-a-3b-dense-reasoning-model-built-on-qwen2-5-coder-3b-with-the-spectrum-to-signal-post-training-pipeline
HN discussion: https://news.ycombinator.com/item?id=48639240

Architecture

The Benchmark Debate

What Surprised Me

Sources

RELATED_ENTRIES

The AI company that refused the Pentagon wants your face now

Google paid $2.7B to keep him. He left anyway.

Your 11.9B inpainting model just got outperformed by something 50x smaller