On-device AI agents no longer cap out at 32K context

On-device models have always had a tradeoff. You can have a small model that fits on a laptop, or you can have one that actually remembers what you asked five minutes ago. Liquid AI just published a paper showing you no longer have to pick.

Architecture

The LFM2.5-8B-A1B is a sparse Mixture-of-Experts model: 8.3 billion total parameters, 1.5 billion active per token. Twenty-four layers, 18 of them double-gated LIV convolution blocks and 6 standard GQA attention layers. The active parameter count puts it in the same compute class as a 1.5B dense model while the total parameters provide the representational capacity of something much larger.

The big architectural change from the October 2025 LFM2-8B-A1B is the context window: 32,768 tokens to 131,072. That is a 4x increase. They achieved it through a two-phase training process: 2 trillion tokens at 32K context for reasoning and tool use, then 400 billion tokens with an increased RoPE base to extend to 128K.

The vocabulary also doubled from 65,536 to 128,000 tokens. That is not a cosmetic change. Non-Latin scripts like Thai (+238%), Hindi (+120%), and Vietnamese (+90%) now tokenize at dramatically lower overhead. For multilingual tool-calling agents, this matters a lot.

Training

The training budget went from 12 trillion tokens to 38 trillion. But the interesting work is in the RL stage.

Liquid AI implemented something they call an "avg@k" reward mechanism. Instead of rewarding the model for any valid completion, the RL stage penalizes generations where the response quality does not correlate with the reward. The practical effect is that the model learns to abstain from answering when it is uncertain rather than hallucinating a confident-sounding wrong answer.

They also added preference optimization specifically to detect and shut down "doom loops," those repetitive reasoning spirals where a model generates "Wait, let me reconsider" over and over without making progress. The model is explicitly trained to stop doing that.

Benchmarks

Benchmark	LFM2-8B-A1B (Oct 2025)	LFM2.5-8B-A1B (May 2026)	Δ
AA-Omniscience (Non-Hallucination)	7.46	63.47	+56.01
IFEval (Instruction Following)	79.44	91.84	+12.40
MATH500	74.80	88.76	+13.96
BFCLv4 (Tool Calling)	25.52	48.50	+22.98
Tau² Telecom	13.60	88.07	+74.47

The BFCLv4 jump from 25.5 to 48.5 is the headline number here. That is the Berkeley Function Calling Leaderboard, and it specifically measures the ability to pick the right tool and format the arguments correctly. Doubling that score in six months on the same architecture is meaningful.

Speed

This is where the model gets genuinely surprising. On an Apple M5 Max, it runs at 253 tokens per second. On an AMD Ryzen AI Max+ 395, 146 tok/s. On a phone, roughly 30 tok/s. On a single H100 at high concurrency, 18,500 tok/s.

Hardware	Tokens/s
Apple M5 Max	253
AMD Ryzen AI Max+ 395	146
Mobile device	~30
NVIDIA H100 (high concurrency)	18,500

For the M5 Max number: 253 tok/s is fast enough that the tool-dispatch loop feels interactive. Ask, propose, confirm, run, repeat. Under a second per dispatch. That threshold matters because it changes the UX from "send a prompt, wait, get an answer" to "converse with a tool-enabled assistant that happens to live on your device."

Tool Calling

The model uses Pythonic function calls wrapped in special tokens:

Tool call start:  <|tool_call_start|>
Tool call end:    <|tool_call_end|>

Day-one support for llama.cpp, MLX, vLLM, SGLang, ONNX, and Liquid's LEAP platform. The GGUF quantized version drops right into any llama.cpp-compatible runtime.

Liquid also open-sourced LocalCowork, a desktop agent demo that manages 67 tools across 13 MCP servers entirely locally. The entire tool dispatch loop runs on-device. No data leaves the laptop.

Community Reaction

On Reddit, the r/LocalLLaMA thread called it "insanely good" and users noted the 1.2B variant of the same family tied for first place in an independent tool-calling benchmark against models three times its size. On Hacker News, the top comment speculated about Apple acquiring Liquid AI, citing MacRumors reporting that on-device AI is now a key focus area for Apple.

A fine-tuning study from distil labs showed that even the 350M variant reached 96-98% tool-call equivalence after targeted fine-tuning, matching or exceeding a 120B teacher model on three multi-turn benchmarks. That suggests the architecture itself is well-suited to tool use, and the gains are not just from scale.

So What

The 128K context window on an on-device model is not just a spec bump. It changes what local agents can actually do. At 32K, you could fit a few pages of conversation before the model started forgetting context. At 128K, you can run multi-step tool chains where each step generates artifacts the next step references. You can load an entire codebase directory into context. You can maintain a conversation across dozens of turns without degradation.

The hallucination reduction through avg@k reward is the quieter innovation here. On-device models have always been more prone to hallucination because they have less representational capacity to encode uncertainty. Teaching the model to abstain rather than guess is a practical fix that matters more for local deployment than it does for datacenter models that can fall back to retrieval.

What I find most interesting is the speed curve. 253 tok/s on a laptop CPU means the bottleneck is no longer the model. It is the application layer: tool orchestration, context management, plugin dispatch. The hardware is finally fast enough that software architecture, not inference latency, determines agent responsiveness.

Sources

Liquid AI Blog: https://www.liquid.ai/blog/lfm2-5-8b-a1b
HuggingFace Model: https://huggingface.co/LiquidAI/LFM2.5-8B-A1B
HuggingFace GGUF: https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF
Reddit r/LocalLLaMA: https://www.reddit.com/r/LocalLLaMA/comments/1tq8a40/liquidailfm258ba1b_hugging_face/
Hacker News: https://news.ycombinator.com/item?id=48310538
distil labs fine-tuning study: https://www.distillabs.ai/blog/fine-tuning-liquids-lfm25-accurate-tool-calling-at-350m-parameters
TechCrunch/Benchmark writeup: https://www.marktechpost.com/2026/05/28/liquid-ai-releases-lfm2-5-8b-a1b-an-on-device-moe-model-with-8-3b-total-and-1-5b-active-parameters

Architecture

Training

Benchmarks

Speed

Tool Calling

Community Reaction

So What

Sources

RELATED_ENTRIES

$4.40 per million tokens just matched the $200 tier

AI coding costs hit $2,000 per engineer and budgets blew up

Production AI build times just got cut in half