Your phone or laptop LLM has been capped at 32K context since day one. That is enough for a few chatbot exchanges, maybe a short document. It is not enough for real agentic work . Chaining tool calls, tracking multi-turn conversations, holding a codebase in memory.
Liquid AI just changed that. Their new LFM2.5-8B-A1B runs on an entry-level MacBook at 253 tokens per second with a 128K context window. That is four times the context of the previous generation, at the same speed, on the same hardware.
Architecture
LFM2.5-8B-A1B is a sparse Mixture-of-Experts model. It has 8.3 billion total parameters but activates only 1.5 billion per token. That sparsity is what makes it run on consumer hardware . Most of the model stays dormant during any single forward pass.
The architecture combines three building blocks:
- Gated short convolution blocks (18 of 24 layers) , Liquid AI's signature LIV convolution, which mixes local and global information efficiently.
- Grouped Query Attention layers (6 of 24) , keeping the model compatible with existing inference stacks.
- MoE routing . Each token is routed to a subset of experts, keeping active parameters at 1.5B.
The context window quadrupled from 32K to 128K tokens. Liquid achieved this with a two-phase strategy: continue pretraining for 2 trillion tokens at 32K context, then a final 400 billion token phase with adjusted RoPE base theta to extend to 128K. No architectural changes, just smart scaling of the position encoding.
The vocabulary doubled to 128K tokens, adding coverage for non-Latin scripts. Tokenization efficiency improved 238% for Thai and 117% for Vietnamese , a practical win for global deployment that most model releases ignore.
Benchmarks
The jump from LFM2 to LFM2.5 is not marginal. It is a clean step function across every metric:
| Benchmark | LFM2-8B-A1B | LFM2.5-8B-A1B | Delta |
|---|---|---|---|
| AA-Omniscience (Non-Hallucination) | 7.46 | 63.47 | +56.01 |
| IFEval (Instruction Following) | 79.44 | 91.84 | +12.40 |
| MATH500 | 74.80 | 88.76 | +13.96 |
| Tau² Telecom (Tool Calling) | 13.60 | 88.07 | +74.47 |
| BFCLv4 (Agentic) | — | 48.50 | — |
The non-hallucination score jump is the most interesting. Liquid added a dedicated RL stage using an avg@k-based reward that reinforces the model to abstain when it does not know the answer. This is the same technique behind Claude's refusal training . Seeing it work at 1.5B active parameters is frankly surprising.
The Tau² Telecom score went from 13.60 to 88.07. That signals that the tool-calling pipeline got a real overhaul. The model now chains tool calls reliably enough for production agent loops.
Inference Speed
Here is where the model gets genuinely impressive:
| Hardware | Speed | Memory |
|---|---|---|
| Apple M5 Max (CPU) | 253 tok/s | < 6 GB |
| AMD Ryzen AI Max+ 395 | 146 tok/s | < 6 GB |
| iPhone / Android (mobile) | ~30 tok/s | fits on device |
| Single H100 (GPU, high concurrency) | 18,500 tok/s | — |
253 tokens per second on a laptop CPU means the model generates responses faster than you can read them. The tool-dispatch loop : ask, propose, confirm, run, repeat. It completes in well under a second per dispatch. This is interactive, not batch.
Day-one support for llama.cpp, MLX, vLLM, SGLang, and ONNX means you do not have to wait for community ports. The GGUF quantized weights are on HuggingFace already.
Community Reaction
The HN thread is still young (31 points, 1 comment) but the Reddit reaction in r/LocalLLaMA has been positive, with the focus on practical deployment:
"I'm particularly excited for this one as it may allow teams to scale this architecture for VLAs, and having sparser models means more real-time actions on a locally hosted model." (comment by adityashankar on HN)
On X, TeksEdge noted: "Loving the LFM2.5-8B-A1B jump to 128K context + 1.5B active params. I wonder how well does that 1B active resist context drift as the context grows."
The skepticism is predictable. A model this small being usable for tool calling feels too good to be true. The LFM2 release from October 2025 was solid but not exceptional. LFM2.5 is the first release from Liquid that genuinely competes with larger models for agentic workloads.
Sources
- Liquid AI Blog: LFM2.5-8B-A1B: an Even Better on-Device Mixture-of-Experts
- HuggingFace Model: LiquidAI/LFM2.5-8B-A1B
- GGUF Weights: LiquidAI/LFM2.5-8B-A1B-GGUF
- Hacker News Discussion: Liquid AI reveals 8B-A1B MoE trained on 38T
- Reddit r/LocalLLaMA: Liquid AI releases LFM2.5-8B-A1B
- MarkTechPost Coverage: Liquid AI Releases LFM2.5-8B-A1B
- Unsloth Tutorial: How to Run & Fine-tune LFM2.5
So What
A 1.5B active parameter model that chains tool calls at 253 tok/s on a laptop, with 128K context, is not an incremental improvement. It changes what "on-device AI" means.
Until now, running an agent locally meant accepting either tiny context (32K) or slow speeds. You could do tool calling on-device, but you could not hold a meaningful conversation history. LFM2.5 breaks that tradeoff.
The catch is that the model is reasoning-only now. It generates explicit chain-of-thought before every answer. That adds latency for simple queries where a 1.5B forward pass would suffice. But for agentic workloads where correctness matters more than speed on individual tokens, the tradeoff makes sense.
The real signal here is not "Liquid shipped a good model." It is that sparse MoE at this scale , 8.3B total and 1.5B active, finally works for production agent loops. Anyone building local-first AI tools should take this seriously.