Most coding agents today are the same thing: a strong model bolted onto a hand-built orchestration harness. The harness decides how to break tasks down, when to retry, which tools to call, and how to structure the rollout. It works, but it's a bottleneck. The model is only as good as the scaffold someone designed for it.

DeepReinforce's Ornith-1.0 throws that away. Instead of a fixed scaffold, the model learns to write its own. During reinforcement learning, it proposes a task-specific harness, executes a solution using that harness, and gets rewarded on both the quality of the scaffold and the final output. The harness and the solution co-evolve.

The result is a family of four open-source models, MIT-licensed, ranging from a 9B dense model that fits on a single GPU to a 397B mixture-of-experts flagship. And the flagship isn't just "good for open-source", it scores 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, both above Claude Opus 4.7.


The self-scaffolding trick, in plain English

Think of a coding agent like a contractor. You can give them a detailed blueprint (the scaffold), or you can let them figure out the best approach for each job. Ornith-1.0 does the latter, and it does it during training.

In each RL step, the process works in two stages. First, the model reads the task and proposes a refined scaffold, essentially deciding for itself how to orchestrate the solution. Then it generates a solution using that scaffold. Rewards flow back to both stages. The model doesn't just learn to write better code; it learns to write better processes for writing code.

This is a meaningful departure from how most RL-trained coding models work. Typically, the harness is frozen: OpenHands, Harbor, Claude Code's toolchain. The model optimizes within those constraints. Ornith-1.0 optimizes the constraints themselves.

The technical paper describes a pipeline-RL strategy with staleness weighting to handle the long rollouts this approach demands. Older tokens in multi-hour trajectories get exponentially downweighted, preventing stale gradient signals from corrupting the training signal.


The numbers: 35B beats 397B-class competitors

The flagship 397B MoE gets the headlines, but the 35B MoE is the most interesting model in the family. It scores 64.2 on Terminal-Bench 2.1 and 75.6 on SWE-Bench Verified. For context, Qwen 3.5-397B, a model with over 10x the active parameters, scores 53.5 on the same Terminal-Bench evaluation. That's a 20% gap in favor of the smaller model, and it suggests the self-scaffolding approach is doing real work, not just scaling its way to competitive scores.

At the bottom of the lineup, the 9B dense model hits 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified, matching the performance of Gemma 4-31B and Qwen 3.6-35B. A 9B model running on a single 80GB GPU performing within spitting distance of models three to four times its size is the kind of efficiency story that actually matters for deployment.

The full benchmark picture for the 397B flagship: 82.4 SWE-Bench Verified, 77.5 Terminal-Bench 2.1, 62.2 SWE-Bench Pro, 78.9 SWE-Bench Multilingual. For reference, Claude Opus 4.8 scores 87.6 and 85.0 on the first two benchmarks respectively. Ornith closes most of the gap but doesn't quite reach the current frontier ceiling.

How it stops itself from cheating

Letting a model write its own scaffolding creates an obvious attack surface. What if the model learns to game the verifier instead of actually solving the problem? DeepReinforce addresses this with three defense layers.

The first is a fixed trust boundary: the environment, tool surfaces, and test isolation are immutable and outside the model's reach. The model can propose scaffolds, but it can't modify the evaluation infrastructure. The second is a deterministic monitor that flags and zero-rewards attempts to access unauthorized paths or modify verification scripts. These trajectories are excluded from advantage computation entirely. The third is a frozen LLM judge that is a final veto on top of the verifier, catching intent-level gaming that deterministic checks might miss.

It's not bulletproof, no anti-reward-hacking system is, but it's a thoughtful layered defense. The fact that DeepReinforce published these details openly, rather than burying them, suggests genuine confidence in the approach.

What this means for the open-source coding agent space

The most interesting thing about Ornith-1.0 isn't the benchmark scores. It's the architectural bet: that the scaffold should be learned, not designed. Every major coding agent today, Claude Code, Cursor, Copilot Workspace, uses a human-written orchestration layer. Ornith-1.0's results suggest that approach might be leaving performance on the table.

The 35B MoE outperforming Qwen 3.5-397B is the strongest evidence. The self-scaffolding approach isn't just a curiosity. It's producing measurable efficiency gains at inference time. If you can get better performance with a 35B model than with a 397B model because the smaller model has learned to orchestrate itself more effectively, that changes the economics of running coding agents at scale.

The models are available on Hugging Face under MIT, with serving support for vLLM, SGLang, and Transformers. The 9B model is genuinely edge-deployable, single GPU, 262K context window. The 35B MoE is the sweet spot for teams that want strong performance without the infrastructure overhead of a 397B flagship.

Sources