Your 3070 runs a 35B model at 22 tokens per second

Your 3070 runs a 35B model at 22 tokens per second.

Most people think running a 35B-parameter AI model requires a $1,600 GPU or a cloud subscription. It doesn't. A mixture-of-experts architecture paired with llama.cpp just turned a 2020 gaming card into a local AI workstation — no API keys, no monthly bills, no data leaving your machine.

How MoE Breaks the VRAM Rule

Traditional dense models load every parameter into memory. A 35B dense model needs ~70GB of VRAM at Q4 quantization. That's three RTX 4090s.

MoE models work differently. Qwen3.6-35B-A3B has 35 billion total parameters but only activates 3 billion per token. The "routed experts" — which make up 96% of the model — sit in system RAM. The "always active" parameters (attention layers, dense FFN, shared expert) stay on the GPU. For any given token, the computational load is that of a 3B model, not a 35B one.

This is the cheat code: you don't need to fit the whole model on your GPU. You just need to fit the active parts.

The llama.cpp Offloading Strategy

The key launch flags that make this work:

Flag	What it does
`--n-cpu-moe 32`	Offloads MoE expert layers to CPU/RAM
`-ngl 99`	Pushes all non-expert layers to GPU
`--no-mmap --mlock`	Pins model in RAM, prevents disk thrashing

The --n-cpu-moe flag is the breakthrough. It tells llama.cpp to keep the attention and shared expert weights on the GPU while routing the 32 expert layers through system RAM. Since only 3B parameters fire per token, the CPU handles a lightweight workload while the GPU handles the latency-critical attention operations.

There's a more precise way to do this too. The HuggingFace community guide recommends using -ot (operator offload) with regex patterns to selectively assign layers:

./llama-server -ngl 999 -ot "blk\.([0-9]|[1-2][0-9]|30)\.=CUDA0,exps=CPU"

This gives you fine-grained control: keep specific transformer blocks on GPU, route all expert tensors to CPU. Batch size tuning (-b 4096 -ub 4096) helps smooth out prompt processing when weights are split across the PCIe bus.

Real-World Benchmarks Across Hardware

Hardware	RAM	Model	Quantization	Tokens/sec
GTX 1060 6GB	32GB	Qwen3.6-35B-A3B	Q4_K_M	17–30
RTX 3070 8GB	16GB	Qwen3.6-35B-A3B	Q4_K_M	~22
RTX 4070 Super 12GB	48GB	Qwen3.6-35B-A3B MTP	UD-Q4_K_XL	75–82
RTX 5080 16GB	64GB	Qwen3.5-35B-A3B	Q4	~63

The RTX 3070 result is the interesting one. With only 8GB VRAM and 16GB system RAM — half the recommended 32GB — it still hits 22 t/s. That's faster than most people's ChatGPT response time. On Windows, not even Linux.

The gap between the 3070 and the 4070 Super comes down to two things: MTP (Multi-Token Prediction) support and raw memory bandwidth. The 4070 Super has 504 GB/s vs the 3070's 448 GB/s, and MTP can nearly double throughput by drafting future tokens speculatively. A Reddit user on r/LocalLLaMA reported 80+ t/s on a 12GB RTX 4070 Super with MTP enabled, with 92% draft acceptance rate on code generation tasks.

What MTP Changes

Multi-Token Prediction is the newest speed multiplier. Instead of generating one token at a time, the model speculatively drafts 2-3 future tokens and validates them in a single forward pass. When the draft is correct (80-95% acceptance rate depending on task), you get 2-3x the throughput for free.

The catch: MTP requires a custom llama.cpp build (the am17an/llama.cpp fork) and MTP-compatible GGUF weights from Unsloth. It's not plug-and-play yet. But for anyone willing to compile from source, the payoff is massive.

A YouTuber demonstrated 2x speedup on Qwen3.6 27B with MTP on an RTX 5060 Ti 16GB — jumping from 22 to 42 t/s with just two extra flags (--spec-type mtp --spec-draft-n-max 2).

The HOBBIT Approach: What's Next

A recent paper called HOBBIT (High-efficiency Offloading Based on Bit-level Importance for Token generation) pushes MoE offloading further. Instead of keeping all experts at the same precision, it dynamically replaces less critical cache-miss experts with low-precision versions. This reduces expert-loading latency while preserving accuracy.

The system works in three layers:

Token-level: dynamically loads experts based on importance per token
Layer-level: prefetches likely-needed experts before they're requested
Sequence-level: caches experts across the full context window

HOBBIT is built on llama.cpp but the code isn't open-sourced yet. The llama.cpp community is actively discussing implementing similar functionality (GitHub Discussion #21419). Given that 96% of MoE model parameters are in experts and only 31% are activated during inference, there's enormous room for smarter caching.

The Cost Math That Changes Everything

Setup	Monthly Cost	Context Window	Privacy
ChatGPT Plus	$20	128K	Data sent to OpenAI
Claude Pro	$20	200K	Data sent to Anthropic
Local Qwen3.6-35B-A3B	$0 (electricity)	Up to 256K	Fully local

You can run a model that matches GPT-4-level performance on coding tasks, with 256K context, for the cost of electricity. The model weights are open. Your prompts never leave your machine.

For heavy users, cloud inference for a 70B model can cost $300 to $800 per month. A used RTX 3070 costs $200. The break-even point is measured in weeks.

What This Actually Means

The hardware floor for running competitive AI just dropped to a $300 used GPU and a gaming PC. No CUDA expertise required. No Docker setup. No API rate limits.

The 2026 local LLM boom isn't about ideology. It's about economics. When a 2020 gaming card runs a 35B model faster than most API response times, the value proposition of paying $20/month for ChatGPT collapses. Not for everyone — casual users still benefit from the convenience. But for developers, researchers, and anyone processing sensitive data, the math doesn't lie.

What Surprised Me

The 16GB RAM result caught me off guard. Every guide recommends 32GB minimum for MoE offloading. The original BlackRainLabs research used 32GB and still saw speed variance. Getting 22 t/s on half that RAM means the bottleneck isn't capacity — it's bandwidth. The 3070's GDDR6 moves data fast enough that the CPU-GPU transfer doesn't tank generation speed.

The uncomfortable implication for cloud AI providers: the hardware floor for running competitive models just dropped to a $300 used GPU and a gaming PC. No CUDA expertise required. No Docker setup. No API rate limits.

The MTP results are even more striking. 80+ t/s on a 12GB card means local inference is now faster than most cloud APIs for sustained generation. The question isn't "can I run local AI?" anymore. It's "why am I still paying for API calls?"

How MoE Breaks the VRAM Rule

The llama.cpp Offloading Strategy

Real-World Benchmarks Across Hardware

What MTP Changes

The HOBBIT Approach: What's Next

The Cost Math That Changes Everything

What This Actually Means

What Surprised Me

RELATED_ENTRIES

2.4 trillion parameters and no benchmarks to prove it

Advertised at 1 million tokens. Codex users get 258,000.

A 10-page prompt closed a 30-year math gap in 148 minutes