Your 3070 runs a 35B model at 22 tokens per second.
Most people think running a 35B-parameter AI model requires a $1,600 GPU or a cloud subscription. It doesn't. A mixture-of-experts architecture paired with llama.cpp just turned a 2020 gaming card into a local AI workstation — no API keys, no monthly bills, no data leaving your machine.
How MoE Breaks the VRAM Rule
Traditional dense models load every parameter into memory. A 35B dense model needs ~70GB of VRAM at Q4 quantization. That's three RTX 4090s.
MoE models work differently. Qwen3.6-35B-A3B has 35 billion total parameters but only activates 3 billion per token. The "routed experts" — which make up 96% of the model — sit in system RAM. The "always active" parameters (attention layers, dense FFN, shared expert) stay on the GPU. For any given token, the computational load is that of a 3B model, not a 35B one.
This is the cheat code: you don't need to fit the whole model on your GPU. You just need to fit the active parts.
The llama.cpp Offloading Strategy
The key launch flags that make this work:
| Flag | What it does |
|---|---|
--n-cpu-moe 32 |
Offloads MoE expert layers to CPU/RAM |
-ngl 99 |
Pushes all non-expert layers to GPU |
--no-mmap --mlock |
Pins model in RAM, prevents disk thrashing |
The --n-cpu-moe flag is the breakthrough. It tells llama.cpp to keep the attention and shared expert weights on the GPU while routing the 32 expert layers through system RAM. Since only 3B parameters fire per token, the CPU handles a lightweight workload while the GPU handles the latency-critical attention operations.
There's a more precise way to do this too. The HuggingFace community guide recommends using -ot (operator offload) with regex patterns to selectively assign layers:
./llama-server -ngl 999 -ot "blk\.([0-9]|[1-2][0-9]|30)\.=CUDA0,exps=CPU"
This gives you fine-grained control: keep specific transformer blocks on GPU, route all expert tensors to CPU. Batch size tuning (-b 4096 -ub 4096) helps smooth out prompt processing when weights are split across the PCIe bus.
Real-World Benchmarks Across Hardware
| Hardware | RAM | Model | Quantization | Tokens/sec |
|---|---|---|---|---|
| GTX 1060 6GB | 32GB | Qwen3.6-35B-A3B | Q4_K_M | 17–30 |
| RTX 3070 8GB | 16GB | Qwen3.6-35B-A3B | Q4_K_M | ~22 |
| RTX 4070 Super 12GB | 48GB | Qwen3.6-35B-A3B MTP | UD-Q4_K_XL | 75–82 |
| RTX 5080 16GB | 64GB | Qwen3.5-35B-A3B | Q4 | ~63 |
The RTX 3070 result is the interesting one. With only 8GB VRAM and 16GB system RAM — half the recommended 32GB — it still hits 22 t/s. That's faster than most people's ChatGPT response time. On Windows, not even Linux.
The gap between the 3070 and the 4070 Super comes down to two things: MTP (Multi-Token Prediction) support and raw memory bandwidth. The 4070 Super has 504 GB/s vs the 3070's 448 GB/s, and MTP can nearly double throughput by drafting future tokens speculatively. A Reddit user on r/LocalLLaMA reported 80+ t/s on a 12GB RTX 4070 Super with MTP enabled, with 92% draft acceptance rate on code generation tasks.
What MTP Changes
Multi-Token Prediction is the newest speed multiplier. Instead of generating one token at a time, the model speculatively drafts 2-3 future tokens and validates them in a single forward pass. When the draft is correct (80-95% acceptance rate depending on task), you get 2-3x the throughput for free.
The catch: MTP requires a custom llama.cpp build (the am17an/llama.cpp fork) and MTP-compatible GGUF weights from Unsloth. It's not plug-and-play yet. But for anyone willing to compile from source, the payoff is massive.
A YouTuber demonstrated 2x speedup on Qwen3.6 27B with MTP on an RTX 5060 Ti 16GB — jumping from 22 to 42 t/s with just two extra flags (--spec-type mtp --spec-draft-n-max 2).
The HOBBIT Approach: What's Next
A recent paper called HOBBIT (High-efficiency Offloading Based on Bit-level Importance for Token generation) pushes MoE offloading further. Instead of keeping all experts at the same precision, it dynamically replaces less critical cache-miss experts with low-precision versions. This reduces expert-loading latency while preserving accuracy.
The system works in three layers:
- Token-level: dynamically loads experts based on importance per token
- Layer-level: prefetches likely-needed experts before they're requested
- Sequence-level: caches experts across the full context window
HOBBIT is built on llama.cpp but the code isn't open-sourced yet. The llama.cpp community is actively discussing implementing similar functionality (GitHub Discussion #21419). Given that 96% of MoE model parameters are in experts and only 31% are activated during inference, there's enormous room for smarter caching.
The Cost Math That Changes Everything
| Setup | Monthly Cost | Context Window | Privacy |
|---|---|---|---|
| ChatGPT Plus | $20 | 128K | Data sent to OpenAI |
| Claude Pro | $20 | 200K | Data sent to Anthropic |
| Local Qwen3.6-35B-A3B | $0 (electricity) | Up to 256K | Fully local |
You can run a model that matches GPT-4-level performance on coding tasks, with 256K context, for the cost of electricity. The model weights are open. Your prompts never leave your machine.
For heavy users, cloud inference for a 70B model can cost $300 to $800 per month. A used RTX 3070 costs $200. The break-even point is measured in weeks.
What This Actually Means
The hardware floor for running competitive AI just dropped to a $300 used GPU and a gaming PC. No CUDA expertise required. No Docker setup. No API rate limits.
The 2026 local LLM boom isn't about ideology. It's about economics. When a 2020 gaming card runs a 35B model faster than most API response times, the value proposition of paying $20/month for ChatGPT collapses. Not for everyone — casual users still benefit from the convenience. But for developers, researchers, and anyone processing sensitive data, the math doesn't lie.
What Surprised Me
The 16GB RAM result caught me off guard. Every guide recommends 32GB minimum for MoE offloading. The original BlackRainLabs research used 32GB and still saw speed variance. Getting 22 t/s on half that RAM means the bottleneck isn't capacity — it's bandwidth. The 3070's GDDR6 moves data fast enough that the CPU-GPU transfer doesn't tank generation speed.
The uncomfortable implication for cloud AI providers: the hardware floor for running competitive models just dropped to a $300 used GPU and a gaming PC. No CUDA expertise required. No Docker setup. No API rate limits.
The MTP results are even more striking. 80+ t/s on a 12GB card means local inference is now faster than most cloud APIs for sustained generation. The question isn't "can I run local AI?" anymore. It's "why am I still paying for API calls?"