91.9% coding score. The government decides who uses it.

GPT-5.6 Sol benchmark comparison from OpenAI's deployment safety system card

OpenAI shipped GPT-5.6 yesterday. It's their best model yet. And almost nobody can use it.

The number that matters: 91.9%. That's GPT-5.6 Sol Ultra on Terminal-Bench 2.1, the benchmark that tests whether an AI can plan, iterate, and coordinate tools across complex command-line workflows. For context, Claude Mythos 5 scored 84.3%. GPT-5.5 got 88.0%. The jump from 88.0% to 91.9% doesn't sound dramatic until you realize these are tasks where most models fail hard on the planning step, not the execution step. Sol Ultra doesn't just write better code. It sequences better.

But here's the part that actually matters: only about 20 companies get to find out.

A three-tier model with a government lock

OpenAI didn't ship one model. They shipped three, and they named them like a sci-fi crew: Sol (the flagship), Terra (the mid-tier workhorse), and Luna (cheap and fast). The pricing tells the story of where OpenAI wants the market to go:

Model	Role	Input / Output (per 1M tokens)
Sol	Flagship	$5.00 / $30.00
Terra	Balanced daily work	$2.50 / $15.00
Luna	High-volume	$1.00 / $6.00

Terra is positioned as "GPT-5.5 performance at half the cost." Luna is aimed at bulk classification and high-throughput workflows. Sol is the one that scored 91.9% and matched Anthropic's Mythos on ExploitBench while using one-third of the output tokens.

The pricing is interesting on its own. Sol holds the same rate card as GPT-5.5 ($5/$30), which means OpenAI is treating this as a generational replacement, not a premium tier. If you were paying $5/M for GPT-5.5 input tokens, you'll pay the same for Sol. The performance jump is the upgrade, not the price.

But the real story isn't the benchmarks or the pricing. It's the rollout.

The permission slip nobody asked for

Two weeks ago, the White House sent Anthropic an export control directive that forced the company to take its most advanced models offline for all customers. Anthropic's Mythos and Fable models went dark. The company's own employees were barred from using their own flagship products. The dispute is still unresolved.

Now OpenAI has confirmed that the Trump administration asked them to delay GPT-5.6's public release. Instead of launching to everyone, OpenAI is starting with a small group of "trusted partners" whose identities have been shared with the government for pre-approval. The company says it does not have full visibility into the criteria used for these approvals.

OpenAI's own blog post is worth quoting directly: "We don't think this kind of government access process should become the long-term default. It keeps the best tools from users, developers, enterprises, cyber defenders, and global partners who need them."

Read that again. The company that built the model is publicly saying the government's process is harmful. And they're complying anyway.

The executive order Trump signed in early June established a "voluntary" 30-day review process for AI labs. But there's no formal framework yet. What exists is a de facto licensing regime that OpenAI and Anthropic are navigating in real time, with no clear rules and no appeals process. One HN commenter put it bluntly: "So where are the champions of capitalism?" Another replied: "This has ALWAYS been capitalism."

What's actually new in the model

Beyond the political theater, there are genuine technical advances in GPT-5.6 worth understanding.

Sol introduces a new "maximum reasoning effort" mode. This isn't just a longer chain-of-thought. The model gets additional compute budget to think through multi-step problems before producing output. It's similar in spirit to what Anthropic does with extended thinking, but OpenAI's implementation apparently scales differently.

The bigger architectural shift is "Ultra mode," which uses subagents. Instead of one model call handling everything, Sol Ultra can spawn subsidiary agents to handle subtasks in parallel. This is the first time OpenAI has shipped this capability in a production model. It explains the Terminal-Bench jump: the planning failure mode that plagues most coding agents gets partially sidestepped when you can delegate subtasks to agents that each handle a narrower scope.

There are also unconfirmed reports of a 1.5 million token context window, up from GPT-5.5's 1 million. That's a 43% increase. The engineering behind it reportedly involves FlashAttention-4 kernels optimized for NVIDIA Blackwell GPUs, grouped-query attention to reduce KV cache size, and ring attention to distribute context across multiple GPU nodes. But here's the catch that nobody wants to talk about: research consistently shows that every model suffers from "lost in the middle" syndrome, where accuracy drops as context fills up. A bigger window doesn't mean better utilization of that window.

The safety story is equally interesting. OpenAI dedicated over 700,000 A100-equivalent GPU hours to automated red-teaming. The model includes real-time misuse classifiers, account-level behavioral monitoring, and human-in-the-loop red-teaming. Despite all of this, Sol does not cross the "Cyber Critical" threshold in OpenAI's Preparedness Framework. It can find vulnerabilities and pieces of exploits, but it cannot carry out autonomous end-to-end attacks against hardened targets.

That distinction matters. The government restricted access partly because of cybersecurity concerns. But OpenAI's own testing shows the model isn't capable of the kind of autonomous cyberattacks that would justify that level of restriction. The gap between "can find a vulnerability" and "can exploit it autonomously" is enormous, and Sol sits firmly on the wrong side of that line.

The community isn't buying it

The Reddit and HN reactions are split, but not in the way you'd expect.

On r/codex, one user reported that GPT-5.6 "worse for my usage" compared to 5.4 and 5.5, calling it "a shift in how it behaves." Multiple commenters noted that the model seems to overthink simple tasks, likely due to the new reasoning mode being enabled by default. The consensus: wait for GA and the ability to tune reasoning effort.

On HN, the political angle dominated. The thread about government-restricted access hit the front page within minutes. The sentiment is overwhelmingly negative about the precedent, with several commenters drawing parallels to arms export controls and noting the irony of a "voluntary" process that isn't voluntary at all.

The benchmark community is more measured. Terminal-Bench 2.1 scores are impressive, but independent verification is pending. OpenAI chose which benchmarks to publish, and the ones they picked happen to be the ones where Sol looks best. SWE-bench Verified and FrontierMath Tier 4 results aren't available yet. Until independent evaluators run their own tests, the 91.9% number is a marketing data point, not a scientific finding.

The pricing angle is getting less attention than it should. Terra at $2.50/$15 is positioned as a direct Claude Fable 5 competitor at roughly one-third the cost. If that pricing holds at GA, it changes the economics of every production deployment that's currently paying $15/M for Fable output tokens. That's not a marginal improvement. That's a structural shift in who can afford to build with frontier models.

What happens next

OpenAI says general availability is "coming weeks." They're working with the administration to establish a "repeatable framework" for future releases. CEO Sam Altman confirmed on X that the company intends to move toward wider access as quickly as possible.

But the precedent is set. The two most capable AI labs in the world just had their flagship products restricted by government directive, with no formal legal process, no defined criteria, and no timeline for resolution. OpenAI is complying while publicly objecting. Anthropic is still offline. And the developers who were planning to benchmark these models against their production workloads are stuck waiting for permission that may never come in a form they can rely on.

The 91.9% score is real. The model is genuinely better than what came before. And the most likely path to using it involves a government approval process that nobody voted for and nobody knows how to navigate.

A three-tier model with a government lock

The permission slip nobody asked for

What's actually new in the model

The community isn't buying it

What happens next

RELATED_ENTRIES

Your coding agent's harness is the bottleneck it can't see

6,000 People Tried to Hack This AI Assistant. Nobody Succeeded.

A $475M startup just proved oscillators can draw