Most multi-agent AI systems are held together with duct tape and LangChain scripts. You hardcode which model handles what, pray the prompt routing works, and pray harder when a new model drops and breaks everything. Sakana AI just shipped something different: a 7B model that figures all of that out by itself.

The twist? It's not just routing. It's orchestrating. And on hard benchmarks, it beats the models it's supposed to be coordinating.


What Fugu Actually Is

Sakana Fugu is a 7-billion-parameter language model trained via reinforcement learning to do one thing well: tell bigger, smarter models what to do. You send a single API call to one endpoint. Fugu reads the task, decides which frontier models in its pool should handle which parts, generates natural language instructions for each, and stitches the results together.

It's based on two ICLR 2026 papers from Sakana AI's research team. TRINITY handles evolved coordinator logic for role assignment. The Conductor is the RL-trained model that generates the actual coordination strategies. Together, they form the backbone of what Sakana calls "dynamic orchestration."

The key difference from tools like LangChain or CrewAI: Fugu doesn't follow hardcoded pipelines. It learns coordination strategies through trial and error during training. For each new input, it generates a fresh workflow graph, deciding in real time whether a problem needs one model, a chain of models, or a recursive loop where it calls itself to check its own work.

The Benchmarks Are Hard to Ignore

Here's where the 7B model gets interesting. On SWE-Bench Pro, a real-world software engineering benchmark, Fugu Ultra scores 73.7. That's ahead of Opus 4.8 at 69.2 and GPT-5.5 at 58.6. On GPQA-Diamond, graduate-level science questions, it hits 95.5. On LiveCodeBench, 93.2.

The RL Conductor paper (the research behind Fugu) reports even more granular numbers. On AIME25, a math competition benchmark, the conductor hits 93.3%. On GPQA-Diamond, 87.5%. On LiveCodeBench, 83.93%. Average across all tasks: 77.27%.

But the real story isn't the scores. It's the efficiency. The Conductor uses an average of 1,820 tokens per question. Mixture-of-Agents, a popular multi-agent framework, uses 11,203. That's a 6x reduction in token consumption. The average workflow length is just three steps. Fugu figures out that most problems don't need a ten-step pipeline, they need one or two smart delegations.

"The depth of recursion becomes a tunable compute axis at inference time, requiring no retraining." - Sakana AI, Fugu Beta Announcement

That recursion trick is the sleeper feature. When Fugu reads its own prior output and decides it's not good enough, it spins up a corrective workflow automatically. No human intervention, no predefined retry logic. The model learns during training that checking its own work and fixing mistakes is worth the extra tokens.

Two Tiers, One API

Fugu ships in two variants. Fugu Mini is optimized for latency, handling everyday coding and chat tasks. Fugu Ultra is the full orchestration system, designed for complex multi-step problems like cybersecurity analysis, patent research, and AI research itself.

Both are accessible through a standard OpenAI-compatible API. You can swap Fugu into existing workflows with minimal code changes. The system handles model selection, delegation, verification, and synthesis server-side. Your code never sees the complexity.

Pricing for Fugu Ultra: $5 input and $30 output per million tokens. High-context requests above 272K tokens cost $10 and $45 respectively. Subscription tiers run $20/month for standard, $100 for pro, $200 for max.

The Sovereignty Pitch (and Its Asterisks)

Sakana's marketing leans hard into a geopolitical angle. Fugu delivers "frontier capability without the risk of export controls." The system routes around provider lock-in by making the model pool swappable. If one provider gets restricted, Fugu just switches to another.

It's a compelling argument on paper. But there are three problems worth flagging.

First, you're still dependent on the underlying models. You've swapped vendor lock-in for orchestrator lock-in. If Sakana changes its pricing, model pool, or terms, you're just as stuck as before.

Second, Fugu can't access the very models it's benchmarked against. It doesn't route to Anthropic's Fable 5 or Mythos. It uses whatever's in its pool, which may or may not include the frontier models you actually want.

Third, Fugu is currently unavailable in the EU andEEA due to ongoing GDPR and compliance work. For a product pitched at sovereignty and compliance, that's an awkward gap.

The People Behind It

Sakana AI is based in Tokyo and was co-founded by a co-author of the original Transformer paper. The research team includes people who've worked at Meta FAIR, Google Research, and Microsoft Research. Imanol Schlag, who co-leads Apertus (the Swiss sovereign AI model), has a different approach to the same problem, but the fact that both projects are gaining traction signals that the "one model to rule them all" era is ending.

The RL Conductor paper was accepted at ICLR 2026. The team's prior work on fast weight programmers and neural architecture innovations feeds directly into how Fugu handles dynamic task decomposition.


So What

The 7B model beating GPT-5 at its own game is the headline, but the real signal is efficiency. 1,820 tokens versus 11,203. That's not a marginal improvement. That's an entirely different cost structure for multi-agent systems.

I keep coming back to the self-recursion feature. Most agent frameworks have hardcoded retry logic: if the output fails validation, try again with a different prompt. Fugu learns during training when to retry and how to fix its own mistakes. That's closer to how senior engineers actually work, checking their own output before shipping.

The sovereignty argument is clever but premature. Being "not dependent on any single provider" sounds great until you realize you're dependent on Sakana's model pool selection and pricing. The real test will be whether Fugu can maintain its benchmark numbers when the underlying models in its pool change.

There's something genuinely exciting about a 7B model that doesn't try to be smart itself but instead learns to make other models work together better. That might be the most practical application of AI coordination we've seen so far.

Sources