59% SWE-Bench score from a model costing $0.30 per million tokens

Your coding agent just got a serious price cut. MiniMax dropped M3 on June 1, an open-weight model that scores 59.0% on SWE-Bench Pro, beating GPT-5.5 and Gemini 3.1 Pro. The catch: it's from a Chinese company legally required to cooperate with state intelligence. The other catch: those benchmarks are vendor-run, not independently verified. But the price is real, and the weights are real, and the community is already downloading them.

Architecture

MiniMax M3 uses a Mixture-of-Experts architecture with roughly 428 billion total parameters, but only activates about 23 billion per token. That's the efficiency play: massive capacity with modest inference cost.

The headline innovation is MiniMax Sparse Attention (MSA). Classic attention scales quadratically with context length, which makes million-token windows economically brutal. MSA fixes this with a two-stage process: a lightweight index branch identifies which blocks in the key-value cache are relevant, then the model computes full attention only on those blocks. The math works out to 1/20th the per-token compute at 1M tokens compared to the M2 generation, with input processing roughly 9.7x faster and response generation about 15.6x faster.

Independent researcher Elie Bakouch described it as "block level selection like in CSA but attention is done on the real KV, not in the compressed dimension." That's important: MSA operates on uncompressed key-values, so you don't lose precision at long context the way latent attention approaches do.

Training ran on approximately 100 trillion tokens, with interleaved text and image data from the start (not bolted on later). MiniMax also used a simulator framework that mimics multi-turn developer collaboration, letting the model practice refining requirements across long sessions.

Benchmarks

All scores below are vendor-run on MiniMax's internal infrastructure. Independent verification from Artificial Analysis and LMArena is pending.

Benchmark	M3	Claude Opus 4.8	GPT-5.5	Gemini 3.1 Pro
SWE-Bench Pro	59.0%	69.2%	~57%	~56%
Terminal-Bench 2.1	66.0%	74.6%	—	—
BrowseComp	83.5	—	—	—
OSWorld-Verified	70.0%	83.4%	—	—

A few things jump out. The BrowseComp score of 83.5 beats Claude Opus 4.7's 79.3, which is the most impressive standalone number. SWE-Bench Pro at 59.0% beats GPT-5.5 and Gemini 3.1 Pro, but trails Opus 4.8 by about 10 points. Terminal-Bench and OSWorld show a wider gap against Opus 4.8.

There's also a versioning gotcha. MiniMax benchmarked against Claude Opus 4.7, not the 4.8 that shipped a week before M3. The comparison that matters is M3 vs Opus 4.8, and on SWE-Bench Pro that gap is 59.0% vs 69.2%. Still ahead of GPT-5.5, but not the headline number suggests.

Real-World Demonstrations

MiniMax showed M3 running autonomous coding sessions for 24+ hours without human intervention. The demonstrations include:

CUDA kernel optimization: M3 started with a task description and a Triton skeleton that couldn't run, then optimized an FP8 matrix multiplication kernel on NVIDIA Hopper GPUs. GPU utilization went from 7.6% to 71.3% over 147 attempts. That's a 9.4x speedup on one of the most compute-intensive building blocks in model inference.
Paper reproduction: M3 independently reproduced an LLM fine-tuning paper over 12 hours, producing 18 commits and 23 figures with no human guidance.
Self-training: The model synthesized data and trained four base models without human intervention.

The M2.7 predecessor reportedly handled 30-50% of MiniMax's internal RL team's workflow. M3 is positioned to handle significantly more.

Pricing

Plan	Monthly	Tokens Included
Plus	$20	~1.7 billion
Max	$50	~5.1 billion
Ultra	$120	~9.8 billion

API pricing sits at $0.30/1M input tokens and $1.20/1M output tokens during the launch promo. For comparison, Claude Opus 4.8 costs roughly $15/1M input and $75/1M output. That's a 50x cost difference on input tokens.

OpenRouter lists the model with a 50% off promo. Fireworks.ai offers a free GPU-accelerated endpoint on build.nvidia.com.

The Elephant in the Room

MiniMax is a Shanghai-based company. China's 2017 National Intelligence Law requires companies to "support, assist, and cooperate" with state intelligence work. That obligation applies to every prompt processed through the API, regardless of where the user is located.

The U.S. House Committee on Homeland Security is currently investigating MiniMax over national security risks. TechTimes flagged that developers should treat M3 as a high-risk choice for sensitive or regulated enterprise workloads.

There's also a copyright angle. Disney, Universal, and Warner Bros. Discovery have filed a lawsuit against MiniMax over training data practices.

The model weights did ship on Hugging Face on June 12 (roughly 428B parameters, ~23B activated), but the technical report came with the same day. NVIDIA congratulated the team and offered the free endpoint. The open-weight community is already downloading and testing.

Community Reaction

The LocalLLaMA response is mixed. One user reported that M3 "was not able to solve problems in both python nor java" and that "the new projects took an insane amount of retry by m3 to make them work." Another noted it's "not yet a very reliable model" compared to MiMo V2.5.

A Medium evaluation found the agentic workflow results "complicated" but acknowledged the cost advantage. The model is clearly capable in specific scenarios (the kernel optimization demo is genuinely impressive) but reliability in general-purpose coding tasks is still being validated.

The broader community concern is the vendor-run benchmark problem. Every score comes from MiniMax's own infrastructure with their own scaffolding. Artificial Analysis and LMArena haven't published independent numbers yet. History teaches us that vendor-reported benchmarks tend to look better than independent runs.

What Surprised Me

The kernel optimization result is the most interesting data point, and it's not the one in the headlines. Taking GPU utilization from 7.6% to 71.3% on an FP8 matrix multiplication kernel means M3 can do real engineering work, not just generate boilerplate. That's a meaningful capability gap over models that score well on SWE-Bench but can't optimize compute kernels.

The pricing math is also hard to ignore. At $0.30/1M input tokens, you could run an entire agentic coding session for what Opus 4.8 costs on a single file. For teams doing high-volume code generation or long-context analysis, that cost difference compounds fast.

But the reliability concerns are real. If M3 needs "insane amounts of retry" on basic Python and Java tasks, the cost advantage evaporates. You're paying less per token but burning more tokens to get the same result. The real comparison isn't price-per-token, it's price-per-working-solution.

The Chinese jurisdiction question is also genuinely complex. For personal projects and open-source work, it probably doesn't matter. For enterprise code, government contracts, or anything touching regulated data, it's a dealbreaker. The fact that the U.S. House is investigating suggests this isn't going away.

My take: M3 is the most interesting open-weight release of the month, maybe the year. The architecture is genuinely novel, the price is aggressive, and the weights are real. But the gap between "impressive demo" and "reliable daily driver" is still wide. Watch for independent benchmarks in the next two weeks. That's when we'll know if 59% on SWE-Bench Pro holds up outside MiniMax's own sandbox.

Architecture

Benchmarks

Real-World Demonstrations

Pricing

The Elephant in the Room

Community Reaction

What Surprised Me

Sources

RELATED_ENTRIES

The smallest model in the room just took charge

A rocket company just bought your coding agent for $60B

AI agents can't sign up for anything. Cloudflare just fixed that.