Zhipu just shipped 1M context to coding. Where are the benchmarks?

Your coding agent just hit the context wall again. You fed it 200K tokens of codebase and it started hallucinating imports that don't exist. Zhipu just tried to fix that. GLM 5.2 ships today with a 1M token context window, five times what GLM-5 offered, and it's available on every tier of their coding plan. The catch? No benchmarks. At all.

What GLM 5.2 Actually Is

GLM 5.2 is Zhipu AI's latest model, released June 13, 2026. It's a coding-focused update to the GLM-5 family, expanding the context window to 1M tokens, up from 200K. That's a genuine technical jump. Most frontier models hover between 128K and 256K. Pushing to 1M means you can theoretically feed an entire medium-sized codebase into a single prompt.

The model ships with two thinking modes: High and Max. Zhipu recommends Max for complex coding tasks. Both are available immediately on all four GLM Coding Plan tiers (Lite, Pro, Max, Team). MIT-licensed open weights are promised "within one week" but haven't dropped yet.

Here's the model lineage, because Zhipu's versioning is confusing:

Model	Context	Training Hardware	License	Notable
GLM-5 (Feb 2026)	200K	100K Huawei Ascend 910B	MIT	Trained entirely without NVIDIA
GLM-5.1 (Mar 2026)	200K	Huawei Ascend	MIT	58.4% SWE-Bench Pro, 94.6% of Opus 4.6
GLM-5.2 (Jun 2026)	1M	TBD	MIT (next week)	No benchmarks published

The predecessor, GLM-5.1, actually beat GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro (58.4% vs 57.7% vs 57.3%). It scored 1530 Elo on Code Arena, placing third globally. Those are real numbers from independent verification. GLM-5.2 could be better, worse, or the same. Nobody knows because Zhipu didn't publish any.

The No-Benchmarks Problem

This is the part that should bother you. Every major model launch in the last two years has come with benchmark numbers. OpenAI publishes them. Anthropic publishes them. Google publishes them. Even smaller labs like Mistral and Cohere publish them. Zhipu launched GLM 5.2 to paying customers with zero performance data.

The official announcement says the model is "superior to prior GLM versions on long-horizon coding." That's a claim, not a number. AICodeKing, who got early access, called it "great at one-shot wonders" and said it fine-tuned a whole local model in 30 minutes. That's an anecdote, not a benchmark.

The HN thread on GLM-5.1 (the predecessor) had some telling comments. One user noted that "one-shot performance is more impressive than its agentic abilities" and flagged "context rot" as a known issue. Another said GLM-5.1 is "closer to Gemini 3.1 and Sonnet-4.6, quite far from Opus." These are community tests on the previous model. With GLM 5.2 claiming 5x more context, the context rot problem either got fixed or got five times worse. We just don't know.

Community Reaction

The r/LocalLLaMA thread on GLM 5.2 is a mix of excitement and frustration. Users want to test it badly, but the coding plan service has been unreliable:

"I really want to test this out, but I can't justify paying for their coding plans with such poor service. I have pretty poor experience with their coding plan in the past."

"It either compacts at way below its quoted context, looping, 429 errors, just shuts off with issues."

One user pointed out that "GLM-5.1 is the best local coding model so far, but already much bigger than GLM-4.7." The model itself gets respect. The infrastructure around it doesn't.

The community is currently voting on what matters most: longer context windows, MIT open weights, or maintaining current pricing. Open weights are winning. That tells you something about where the real value is perceived.

Why This Matters

The competitive landscape for coding models is tightening fast. Claude Code, GPT-5.5 Codex, and now GLM 5.2 are all targeting the same developer workflow. The difference is pricing and openness.

GLM-5.1's API pricing was $1.00 per 1M input tokens and $3.20 per 1M output tokens. Claude Opus charges $5.00 and $25.00 respectively. That's a 5x to 8x cost difference. If GLM 5.2 maintains that gap while genuinely improving long-context coding, it changes the calculus for teams running high-volume coding agents.

The MIT license is the real play. When the weights drop next week, anyone can fine-tune, deploy, and commercialize GLM 5.2 without asking permission. In a world where Anthropic just got sanctioned by the US government for foreign use of its top models, an MIT-licensed alternative trained on Huawei chips looks increasingly attractive to non-US developers.

What Surprised Me

The decision to ship without benchmarks is either extremely confident or extremely desperate. If the model is genuinely better than GLM-5.1, Zhipu would want to prove it. The fact that they didn't suggests either the improvements are modest, or they're betting that the 1M context number alone will drive adoption.

I keep thinking about the context rot issue. Scaling context windows to 1M tokens is easy to claim and hard to deliver. The attention mechanisms that make 200K work don't automatically scale to 1M without degradation. Every lab that's pushed context windows this far has hit quality walls. Zhipu's refusal to publish benchmarks might be hiding exactly this problem.

But here's what I can't dismiss: GLM-5.1 was legitimately competitive with Opus on coding benchmarks, trained entirely on Huawei chips, and released under MIT. That's not marketing. That's engineering. If GLM 5.2 is even 10% better at long-context tasks, the combination of price, openness, and context size makes it the default choice for cost-conscious coding teams.

The real test happens when the MIT weights drop and the community runs independent benchmarks. Until then, GLM 5.2 is a 1M-token promise with no receipts.

What GLM 5.2 Actually Is

The No-Benchmarks Problem

Community Reaction

Why This Matters

What Surprised Me

Sources

RELATED_ENTRIES

The Government Just Killed Anthropic's Best Models Over a Single Jailbreak

Your coding agent wastes tokens thinking. This one doesn't.

Your AI Agent Will Burn Your AWS Budget. Here's Proof.