Something unusual is happening in the Chinese AI landscape. A Shanghai-based startup called MiniMax just released M3, and it's not trying to beat GPT-5.5 on MMLU or compete with Claude on creative writing. Instead, it's targeting a specific, hard problem: making open-weight models useful for real software engineering workflows.

And the numbers suggest it might have pulled it off.

MiniMax M3 is the first open-weight model to combine a 1-million-token context window, native multimodality, and frontier-level coding performance. On SWE-Bench Pro, it scores 59.0%, ahead of both GPT-5.5 and Gemini 3.1 Pro. On Terminal-Bench 2.1, it hits 66.0%. And it does all of this while consuming roughly 1/20th of the compute per token compared to its predecessor.


Don't Confuse It With the Earlier Models

A quick note on naming, because MiniMax's versioning has generated real confusion. The company's model lineage goes:

  • MiniMax-01 (January 2025): The original, with MiniMax-Text-01 and MiniMax-VL-01. Used Lightning Attention hybrid architecture. 456B parameters, 4M context.
  • MiniMax-M1: An MoE-based iteration that shifted architecture significantly.
  • MiniMax-M2 / M2.1 / M2.5 / M2.7: A rapid iteration cycle through 2025, with M2.7 reportedly involved in its own development.
  • MiniMax M3 (June 2026): The current model. Not a linear update to MiniMax-01. This is a completely new architecture with MiniMax Sparse Attention.

The "3" in M3 refers to three frontier capabilities (coding, context, multimodality) in a single model, not version 3 of the same architecture. The benchmarks below are for M3 specifically, not for any earlier model.


The Architecture: MiniMax Sparse Attention

The headline technical contribution in M3 is **MSA (MiniMax Sparse Attention), a new attention mechanism designed to solve the quadratic scaling problem that makes 1M-token contexts impractical for dense attention.

Instead of comparing every query against every key-value pair, MSA filters the KV cache into blocks and processes only the blocks relevant to the current query. The key innovation is contiguous memory access: blocks are read sequentially rather than jumping through memory, achieving arithmetic intensity 4x higher than Flash-Sparse-Attention.

The practical impact is dramatic:

Metric Improvement vs. Previous Gen
Per-token compute at 1M context 1/20th
Prefilling speed >9x faster
Decoding speed >15x faster
Arithmetic intensity vs. Flash-Sparse-Attention 4x higher

This isn't just an academic improvement. It means a model with 1M context is actually usable in production without burning through your inference budget. The long-context pricing kicks in only above 512K tokens, and even then it's competitive with proprietary alternatives.


Benchmark Performance

MiniMax M3 was evaluated across software engineering, agentic tasks, and autonomous capabilities. Here's how it stacks up:

Coding & Software Engineering

Benchmark MiniMax M3 GPT-5.5 Claude Opus 4.7 Gemini 3.1 Pro
SWE-Bench Pro 59.0% ~57% ~61% ~56%
Terminal-Bench 2.1 66.0%
SWE-fficiency 34.8%
KernelBench Hard 28.8%
MCP Atlas 74.2%

M3 beats GPT-5.5 and Gemini 3.1 Pro on SWE-Bench Pro, trailing only Claude Opus 4.7. The Terminal-Bench score (66.0%) is particularly strong for an open-weight model; it indicates real capability in autonomous terminal-based workflows.

Agentic & Autonomous

On BenchLM's provisional leaderboard, M3 ranks #29 overall (76/100) and #13 in Agentic tasks with an average score of 82.4. Its BrowseComp score of 83.5 surpasses Opus 4.7 (79.3), suggesting strong autonomous web research capability.

Multimodal

M3's multimodal performance is where it's weakest, ranking #69 in multimodal and grounded tasks with an average score of 47.3. This is acceptable for an open-weight model but clearly behind specialized multimodal models like Gemini 3.1 Pro or GPT-5.5 Vision.


Real-World Autonomous Performance

Benchmarks only show so much. The real question: can the model actually work autonomously on hard problems? MiniMax ran three demonstrations that are worth paying attention to:

Paper Reproduction: M3 autonomously reproduced an ICLR 2025 award-winning paper, running for 12 hours, generating 18 commits, and producing 23 figures without human intervention.

CUDA Kernel Optimization: Given a Triton skeleton on Nvidia Hopper GPUs with no reference implementation, M3 iterated autonomously for 24 hours and 145 attempts, improving FP8 hardware utilization from 7.6% to 71.3%, a 9.4x improvement.

PostTrainBench: M3 successfully trained four base models autonomously, demonstrating that it can handle ML training workflows as multi-step agentic tasks.

These aren't cherry-picked one-shots. The model was run multiple times with consistent results, and MiniMax implemented strict constraints to prevent benchmark hacking, no external information access, monitored Bash commands, and sandboxed execution.


Pricing and Availability

M3 is available now via the MiniMax API and through OpenRouter. The weights and technical report are expected within 10 days of the announcement.

Plan Price Token Allocation
Plus $20/month ~1.7B tokens
Max $50/month ~5.1B tokens
Ultra $120/month ~9.8B tokens

API pricing uses standard rates for contexts up to 512K tokens, with higher long-context rates above that threshold. A thinking mode can be toggled on for complex reasoning tasks or off for low-latency conversational use.


Business Context

This release comes at a defining moment for MiniMax. The company reported 2025 revenue of approximately $79 million, up 159% year-over-year, and an ARR exceeding $150 million as of February 2026. Overseas revenue accounts for 73% of total sales, making MiniMax one of the most globally successful Chinese AI companies.

The company is preparing for an IPO on Shanghai's Star Market, complementing its existing Hong Kong listing. Recent partnerships include integration with Ant Group's Alipay for payment and subscription infrastructure.


What This Means

The open-weight ecosystem has been catching up to proprietary models all year, but M3 marks an inflection point in two specific dimensions.

First, it proves that 1M-token context is economically viable without proprietary infrastructure. The sparse attention mechanism is the key enabler, and since the weights are open, anyone can verify the efficiency claims rather than taking them on faith.

Second, it establishes a clear trajectory for Chinese AI companies. DeepSeek competes on reasoning benchmarks. Qwen dominates the open-weight ecosystem. MiniMax is staking its claim on agentic coding workflows, a space that directly monetizes through developer tools and API usage.

The immediate question: can the weights and technical report deliver on the architectural promises? The benchmarks are impressive, but the real test comes when the open-source community can inspect MSA and run the model on their own infrastructure. That happens in roughly ten days.

For now, MiniMax M3 is the most interesting open-weight release of the quarter, not because it's the best at everything, but because it's the first to prove that open models don't have to trade off context, coding, and cost.