On-device models have always had a tradeoff. You can have a small model that fits on a laptop, or you can have one that actually remembers what you asked five minutes ago. Liquid AI just published a paper showing you no longer have to pick.
Architecture
The LFM2.5-8B-A1B is a sparse Mixture-of-Experts model: 8.3 billion total parameters, 1.5 billion active per token. Twenty-four layers, 18 of them double-gated LIV convolution blocks and 6 standard GQA attention layers. The active parameter count puts it in the same compute class as a 1.5B dense model while the total parameters provide the representational capacity of something much larger.
The big architectural change from the October 2025 LFM2-8B-A1B is the context window: 32,768 tokens to 131,072. That is a 4x increase. They achieved it through a two-phase training process: 2 trillion tokens at 32K context for reasoning and tool use, then 400 billion tokens with an increased RoPE base to extend to 128K.
The vocabulary also doubled from 65,536 to 128,000 tokens. That is not a cosmetic change. Non-Latin scripts like Thai (+238%), Hindi (+120%), and Vietnamese (+90%) now tokenize at dramatically lower overhead. For multilingual tool-calling agents, this matters a lot.
Training
The training budget went from 12 trillion tokens to 38 trillion. But the interesting work is in the RL stage.
Liquid AI implemented something they call an "avg@k" reward mechanism. Instead of rewarding the model for any valid completion, the RL stage penalizes generations where the response quality does not correlate with the reward. The practical effect is that the model learns to abstain from answering when it is uncertain rather than hallucinating a confident-sounding wrong answer.
They also added preference optimization specifically to detect and shut down "doom loops," those repetitive reasoning spirals where a model generates "Wait, let me reconsider" over and over without making progress. The model is explicitly trained to stop doing that.
Benchmarks
| Benchmark | LFM2-8B-A1B (Oct 2025) | LFM2.5-8B-A1B (May 2026) | Δ |
|---|---|---|---|
| AA-Omniscience (Non-Hallucination) | 7.46 | 63.47 | +56.01 |
| IFEval (Instruction Following) | 79.44 | 91.84 | +12.40 |
| MATH500 | 74.80 | 88.76 | +13.96 |
| BFCLv4 (Tool Calling) | 25.52 | 48.50 | +22.98 |
| Tau² Telecom | 13.60 | 88.07 | +74.47 |
The BFCLv4 jump from 25.5 to 48.5 is the headline number here. That is the Berkeley Function Calling Leaderboard, and it specifically measures the ability to pick the right tool and format the arguments correctly. Doubling that score in six months on the same architecture is meaningful.
Speed
This is where the model gets genuinely surprising. On an Apple M5 Max, it runs at 253 tokens per second. On an AMD Ryzen AI Max+ 395, 146 tok/s. On a phone, roughly 30 tok/s. On a single H100 at high concurrency, 18,500 tok/s.
| Hardware | Tokens/s |
|---|---|
| Apple M5 Max | 253 |
| AMD Ryzen AI Max+ 395 | 146 |
| Mobile device | ~30 |
| NVIDIA H100 (high concurrency) | 18,500 |
For the M5 Max number: 253 tok/s is fast enough that the tool-dispatch loop feels interactive. Ask, propose, confirm, run, repeat. Under a second per dispatch. That threshold matters because it changes the UX from "send a prompt, wait, get an answer" to "converse with a tool-enabled assistant that happens to live on your device."
Tool Calling
The model uses Pythonic function calls wrapped in special tokens:
Tool call start: <|tool_call_start|>
Tool call end: <|tool_call_end|>
Day-one support for llama.cpp, MLX, vLLM, SGLang, ONNX, and Liquid's LEAP platform. The GGUF quantized version drops right into any llama.cpp-compatible runtime.
Liquid also open-sourced LocalCowork, a desktop agent demo that manages 67 tools across 13 MCP servers entirely locally. The entire tool dispatch loop runs on-device. No data leaves the laptop.
Community Reaction
On Reddit, the r/LocalLLaMA thread called it "insanely good" and users noted the 1.2B variant of the same family tied for first place in an independent tool-calling benchmark against models three times its size. On Hacker News, the top comment speculated about Apple acquiring Liquid AI, citing MacRumors reporting that on-device AI is now a key focus area for Apple.
A fine-tuning study from distil labs showed that even the 350M variant reached 96-98% tool-call equivalence after targeted fine-tuning, matching or exceeding a 120B teacher model on three multi-turn benchmarks. That suggests the architecture itself is well-suited to tool use, and the gains are not just from scale.
So What
The 128K context window on an on-device model is not just a spec bump. It changes what local agents can actually do. At 32K, you could fit a few pages of conversation before the model started forgetting context. At 128K, you can run multi-step tool chains where each step generates artifacts the next step references. You can load an entire codebase directory into context. You can maintain a conversation across dozens of turns without degradation.
The hallucination reduction through avg@k reward is the quieter innovation here. On-device models have always been more prone to hallucination because they have less representational capacity to encode uncertainty. Teaching the model to abstain rather than guess is a practical fix that matters more for local deployment than it does for datacenter models that can fall back to retrieval.
What I find most interesting is the speed curve. 253 tok/s on a laptop CPU means the bottleneck is no longer the model. It is the application layer: tool orchestration, context management, plugin dispatch. The hardware is finally fast enough that software architecture, not inference latency, determines agent responsiveness.
Sources
- Liquid AI Blog: https://www.liquid.ai/blog/lfm2-5-8b-a1b
- HuggingFace Model: https://huggingface.co/LiquidAI/LFM2.5-8B-A1B
- HuggingFace GGUF: https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF
- Reddit r/LocalLLaMA: https://www.reddit.com/r/LocalLLaMA/comments/1tq8a40/liquidailfm258ba1b_hugging_face/
- Hacker News: https://news.ycombinator.com/item?id=48310538
- distil labs fine-tuning study: https://www.distillabs.ai/blog/fine-tuning-liquids-lfm25-accurate-tool-calling-at-350m-parameters
- TechCrunch/Benchmark writeup: https://www.marktechpost.com/2026/05/28/liquid-ai-releases-lfm2-5-8b-a1b-an-on-device-moe-model-with-8-3b-total-and-1-5b-active-parameters