A Benchmark Caught AI Faking Answers to Broken Problems

Your AI coding assistant just gave you a confident, well-reasoned answer to a problem that has no solution. You would never know, because the explanation sounds perfectly plausible.

A team of 64 mathematicians just built a benchmark to catch exactly this behavior. It is called Soohak, and the results are uncomfortable for anyone deploying frontier models in production.

The Benchmark

Soohak (arXiv 2605.09063) contains 439 original problems written from scratch by 64 mathematicians, including 38 professors and 5 IMO medalists. No AI assistance was used during problem creation. The dataset is currently locked under embargo to prevent training contamination, with a public release planned for late 2026.

The problems split into two categories:

Subset	Count	Purpose
Challenge	340	Graduate-level and research-adjacent math problems
Refusal	99	Intentionally flawed problems with contradictions or missing assumptions

There is also a companion set called SOOHAK-Mini with 702 questions covering high-school olympiad through early graduate material.

The key insight: existing benchmarks like MATH, GSM8K, and even FrontierMath have been saturated. Models hitting 90 percent on those tests told us very little about what happens when you push past the competition math ceiling into actual research territory.

Research-Level Math Performance

On the Challenge subset, no model came close to mastery:

Model	Accuracy
Gemini 3 Pro	30.4%
GPT-5	26.4%
Claude Opus 4.5	10.4%
Kimi 2.5 (best open-weight)	13.9%
Qwen3 235B	under 15%

No model solved 124 of the 340 challenge tasks, not even once across three attempts. These are problems a skilled graduate student with a whiteboard could eventually work through, and the best frontier models simply cannot touch them.

Performance scales roughly linearly with both training compute and test-time reasoning budget. More tokens spent thinking produces more correct answers. That part tracks with everything we already knew about scaling laws.

The Refusal Gap

This is where things get interesting. The Refusal subset tests something most benchmarks ignore: can a model recognize when a problem is broken and decline to answer?

The results are stark:

Model	Refusal Accuracy
GLM-5	49.5%
Gemini 3 Pro	under 50%
GPT-5	under 50%
Qwen3 family	under 3%

No model exceeded 50 percent. The authors put it directly: more compute makes models better at solving. It does not make them better at admitting a problem has no answer.

GLM-5 is the outlier here. The open-weight model from Zhipu AI nearly hit 50 percent on refusal, outperforming every closed-source model tested. The Qwen3 family collapsed to under 3 percent on refusal, confidently generating answers for problems that were mathematically incoherent.

This is not just a math problem. It is a general failure mode. If a model cannot recognize an ill-posed optimization constraint in a mathematical proof, it will not recognize one in your production code, your legal contract analysis, or your medical triage pipeline.

The Human Baseline

The authors recruited 25 humans ranging from IMO medalists to PhD researchers and tested them on a 79-problem sample. The aggregated human team covered 50.6 percent of problems.

Gemini 3 Pro hit 61 percent on the same sample, outperforming the human group. But there is a catch: contest-trained mathematicians with IMO experience crushed the PhD researchers. The benchmark rewards short solution paths suited to time-constrained environments, not the slow, specialized depth that actual research mathematicians bring to a problem.

This means Soohak measures a specific slice of mathematical ability, and a different benchmark might rank models differently.

Community Reaction

The math community has already started discussing the implications. The consensus among researchers is that refusal capability should become an explicit optimization target, not an emergent property we hope scales with compute.

From a practical standpoint, anyone running AI models on tasks where a wrong answer costs money needs to know: bigger models give more confident wrong answers on broken problems, not fewer. The correlation runs the wrong direction.

"Solution rates climb almost linearly with bigger models and longer reasoning budgets. Refusal does not follow the same pattern."

That quote is from the paper itself. The authors are not hedging.

Why This Matters

We have spent two years chasing benchmarks that measure what models can do. Soohak measures what they cannot refuse to do. And the gap is widening, not narrowing.

The dataset embargo means model makers cannot train on it yet. When it releases in late 2026, we will see whether labs treat refusal as a real training objective or just another leaderboard to optimize away.

Sources

arXiv paper: https://arxiv.org/abs/2605.09063
HuggingFace Daily Papers: https://huggingface.co/papers/2605.09063
The Decoder coverage: https://the-decoder.com/new-math-benchmark-reveals-ai-models-confidently-solve-problems-that-have-no-solution
AI Research Roundup video: https://www.youtube.com/watch?v=HX4wnJU6kr8

The Benchmark

Research-Level Math Performance

The Refusal Gap

The Human Baseline

Community Reaction

Why This Matters

Sources

RELATED_ENTRIES

That 27B model was too big for a phone. Not anymore.

$4.40 per million tokens just matched the $200 tier

AI coding costs hit $2,000 per engineer and budgets blew up