CoT Boosts Agent Performance 10x But RL Still Wins Planning

A new benchmark just put RL agents, LLMs, vision models, and hybrid systems on the same playing field. The results are not what the LLM crowd expected, and they are not what the RL crowd expected either.

Agentick evaluates agents across 37 tasks spanning navigation, planning, reasoning, memory, generalization, and multi-agent coordination. Four difficulty levels per task. Over 90,000 episodes across 27 different agent configurations. The benchmark ships with pre-built training datasets, a coding API, and oracle reference policies so anyone can reproduce the results.

What Agentick Actually Tests

Most agent benchmarks measure one thing really well and call it done. WebArena tests web navigation. SWE-bench tests code fixes. Agentick decomposes agent capability into six dimensions and evaluates each independently. Navigation tests spatial reasoning and pathfinding. Planning tests multi-step lookahead and resource allocation. Reasoning tests logical inference and pattern matching. Memory tests information retention over long horizons. Generalization tests few-shot rule inference. Multi-agent tests coordination with and against scripted opponents.

The scoring uses Oracle-Normalized Score. Random baseline is zero. Oracle upper bound is one. Everything in between tells you how close an agent gets to the best possible performance on that task. The benchmark supports five observation modalities: ASCII, natural language, structured dictionaries, isometric pixel grids, and raw numpy arrays. This matters because some agent architectures simply cannot process certain input formats.

The Results Nobody Expected

GPT-5 mini takes the overall crown at 0.309 ONS. But scratch the surface and the story gets complicated. PPO, a reinforcement learning algorithm from an era people thought was over, dominates Planning at 0.402 and Multi-Agent at 0.432. The model that wins the aggregate score loses two of the six categories to a method that does not use a language model at all.

The reasoning harness is the real headline. Adding Chain-of-Thought prompting to LLM agents multiplies their performance by 3 to 10 times. Not a small bump. Not a single-digit percentage improvement. Three to ten times. An LLM without CoT prompting scores barely above random on planning tasks. Add the reasoning harness and it becomes competitive with purpose-built agents.

Then there is the observation modality finding. ASCII inputs consistently outperform natural language for spatial reasoning. Compressed, token-efficient representations beat verbose descriptions when the task involves understanding layouts, positions, and movement. LLMs process fewer tokens and make better decisions. That is an uncomfortable result for anyone building agents that feed full environment descriptions into a context window.

Capability	Best Agent	ONS Score
Overall	GPT-5 mini	0.309
Planning	PPO	0.402
Multi-Agent	PPO	0.432
Navigation	GPT-5 mini (Reasoner)	competitive
Reasoning	LLM + CoT harness	3-10x over baseline
Memory	varies by task	--

Why This Benchmark Is Different

Agentick decouples the model from the inference strategy through a composable agent harness. The Markovian preset receives the current observation and outputs a single action. The Markovian Reasoner prompts for concise chain-of-thought reasoning before selecting an action. Researchers can swap harness presets without changing the underlying model, which means you can isolate whether performance comes from the model itself or from how you ask it to think.

The pre-built SFT datasets range from 120,000 to 500,000 episodes. Oracle reference policies are provided for all tasks using the coding API. This means the benchmark is not just an evaluation framework. It is a training ground. The authors explicitly flag RL post-training on Agentick as a future direction, using it as an environment for reinforcement learning from verifiable rewards in complex, stochastic, multi-step settings.

The procedural generation system creates reproducible task instances at four difficulty levels. Thirty-seven tasks is enough breadth to prevent gaming any single dimension while remaining narrow enough that results are interpretable.

Community Sentiment

The prevailing view on Hacker News and Reddit is that agent benchmarks are fundamentally broken. Existing evaluations reward benchmark-specific optimization rather than general capability. Agentick addresses this by making cross-paradigm comparison possible for the first time.

The finding that PPO beats GPT-5 mini on planning tasks generated the most discussion. People assumed language models had closed the gap on structured reasoning. The data says otherwise. RL agents trained specifically for sequential decision-making still hold a meaningful edge when the task requires genuine lookahead rather than pattern completion.

The ASCII versus natural language result also drew attention. Several commenters noted that this validates what some practitioners already knew from building grid-world agents: feeding an LLM a wall of text about a spatial environment wastes context window and degrades performance. A compact grid representation gives the model exactly what it needs without the noise.

What This Means for Agent Design

The 3-10x improvement from the reasoning harness is the single most actionable finding. If you are building LLM agents and not using chain-of-thought prompting for sequential tasks, you are leaving most of your model's capability on the table. This is not a subtle optimization. It is the difference between barely-above-random and genuinely competent.

The PPO result on planning and multi-agent tasks should temper expectations about LLMs replacing RL in sequential decision-making. Language models are excellent at one-shot reasoning and pattern recognition. They are not naturally built for iterative policy optimization in environments where the reward structure unfolds over hundreds of steps. PPO still owns that space.

The ASCII finding suggests a broader principle: give agents the minimal representation that preserves task-relevant information, and they perform better. Natural language is expressive but bloated for spatial reasoning. Structured arrays are compact and complete. This applies beyond grid worlds to any domain where the environment has structure that text serialization obscures.

Agentick's live leaderboard means these results will evolve. The current evaluation covers 27 configurations but leaves out flagship models like GPT-5 Pro and Claude Opus. The authors plan to expand coverage. What we have now is a baseline, not a final answer. But it is a baseline built on methodology that actually lets different approaches compete on equal terms.

What Agentick Actually Tests

The Results Nobody Expected

Why This Benchmark Is Different

Community Sentiment

What This Means for Agent Design

RELATED_ENTRIES

That 27B model was too big for a phone. Not anymore.

$4.40 per million tokens just matched the $200 tier

AI coding costs hit $2,000 per engineer and budgets blew up