Your iPhone can't run Apple's new 20B AI model

Your iPhone is about to get a lot smarter. Unless you bought the base model. Then you get the leftovers.

Apple announced its third generation of foundation models at WWDC26, and the headline is a 20-billion-parameter sparse model that runs entirely on-device. No cloud round-trips, no data leaving your phone. It's called AFM 3 Core Advanced, and it's the most ambitious thing Apple has ever shipped for local AI. There's just one problem: it needs 12GB of RAM, and only three iPhones have that.

The five-model lineup

Apple split its new AI stack into two on-device models and three cloud models. The full list:

Model	Where	Size	Active Params	What it does
AFM 3 Core	On-device	3B (dense)	3B	Fast text, lightweight tasks
AFM 3 Core Advanced	On-device	20B (sparse)	1-4B	Siri, dictation, image understanding, TTS
AFM 3 Cloud	PCC	Undisclosed	--	General text and image tasks
ADM 3 Cloud	PCC	Undisclosed	--	Image generation and editing
AFM 3 Cloud Pro	PCC (GCP/NVIDIA)	Undisclosed	--	Complex reasoning, agentic tool use

The 3B Core model runs on any iPhone 16 or later, or iPhone 15 Pro. The 20B Core Advanced requires 12GB of unified memory, which means iPhone 17 Pro, iPhone 17 Pro Max, and iPhone Air. That's it. Three devices get the full on-device AI experience. Everyone else falls back to the 3B model or the cloud.

Instruction-Following Pruning: the real trick

The 20B model doesn't actually load 20 billion parameters into memory at once. Apple developed a technique they call Instruction-Following Pruning (IFP). The full 20B weights sit in NAND flash storage. A lightweight dense block analyzes each prompt and activates only 1-4 billion parameters, dynamically selecting which "experts" to load into DRAM for that specific request.

This is different from static sparsity, where the same subset of parameters is always active. IFP routes per-prompt. A coding question activates different experts than a dictation request or an image understanding task. The model reselects experts periodically during generation, not just at the start.

The result: Apple claims the 3B-active footprint achieves performance comparable to a 9B dense model. On a phone. Without cloud. That's a real engineering achievement, even if the benchmarks come with asterisks.

The Google elephant in the room

Here's the part Apple didn't put in big letters: their most capable cloud model, AFM 3 Cloud Pro, runs on NVIDIA GPUs inside Google Cloud. And the entire model family was refined using outputs from Google's Gemini frontier models as teacher signals during distillation.

Amar Subramanya, Apple's AI VP, put it this way: "All of these are custom builds for Apple Silicon, trained using proprietary data, and refined using outputs from Gemini frontier models."

So Apple's "privacy-first, Apple Silicon everywhere" story has a asterisk. The on-device models genuinely run locally. But the cloud tier, the one handling the hard problems, depends on Google's infrastructure and was shaped by Google's models. Apple built the architecture and the privacy layer (Private Cloud Compute with cryptographic attestation), but the heavy compute lives on NVIDIA hardware in Google's data centers.

This isn't necessarily bad. It's pragmatic. Apple couldn't build the GPU fleet fast enough to match NVIDIA's performance. But it undercuts the "we do everything ourselves" narrative Apple usually sells.

What actually improved

Apple's benchmarks are human preference evaluations against their own 2025 models, not MMLU or SWE-bench. Take them with that grain of salt.

That said, the numbers that matter:

AFM 3 Cloud: 64.7% human preference rate over the 2025 baseline (only 8.7% preferred the old model). That's a real jump.
AFM 3 Core Advanced dictation: 44.7% preference improvement across formatting and comprehension dimensions.
Text-to-speech: Mean Opinion Score of 4.24 for conversational voice quality, up from 3.82. On the MOS scale, a 0.1 increase is considered highly noticeable to users. This is a 0.42 jump.

The TTS improvement is the sleeper hit here. A 0.42 MOS improvement means Siri will sound noticeably more natural in conversational contexts. That's the kind of change users will feel immediately.

The RAM tax is real

The 12GB requirement for AFM 3 Core Advanced creates a two-tier iPhone experience. The base iPhone 17 has 8GB of RAM. It gets the 3B Core model. The Pro, Pro Max, and Air get 12GB and the full 20B experience.

Features locked behind the RAM wall: voice customization (expressiveness, pace control) and advanced systemwide dictation. If you want Siri to speak faster, slower, or with more enthusiasm, you need the Pro.

Apple has done hardware-gated features before (ProMotion, Always-On Display, Action Button), but gating the best AI model behind a RAM requirement is new. It means the "Apple Intelligence" experience is materially different depending on which iPhone you bought, and most people buy the base model.

The privacy architecture is genuinely interesting

Despite the Google dependency, Apple's Private Cloud Compute setup is technically sophisticated. For AFM 3 Cloud Pro running on Google's infrastructure:

A cryptographically verifiable, append-only ledger tracks all Google Cloud hardware in the PCC fleet.
Dual roots of trust: software attestation is rooted in at least two separate, independent vendor roots of trust.
Inference software is recycled with a short TTL (time-to-live), and encryption keys are held in a dedicated, isolated confidential VM.

The claim is that even though the compute happens on Google's hardware, Apple can cryptographically verify that no one (including Google) can access the raw inference data. It's a strong claim, and it's the kind of thing that matters for sensitive queries.

Not available everywhere

Siri AI and the AFM 3 features are not launching in the EU (DMA compliance issues) or mainland China (pending regulatory approval). If you're in those markets, you're stuck with the current Apple Intelligence features for now.

What surprised me

The benchmark situation is what caught my attention. Apple chose human preference evaluations over standardized benchmarks. Every other frontier model lab publishes MMLU, GPQA, MATH scores. Apple published "we asked humans which response they liked better, and they liked ours 64.7% of the time against our own old model."

That's... not nothing. But it's also not comparable to what other labs publish. You can't take Apple's numbers and put them on a chart next to Claude or GPT scores. And that seems intentional. Apple is sidestepping the benchmark race entirely and betting that user experience matters more than leaderboard position.

The honest reading: Apple's models are probably competitive in their specific domains (dictation, Siri, on-device tasks) but probably not competitive in general reasoning or coding. The 64.7% preference rate is strong for an upgrade comparison, but it doesn't tell you how AFM 3 Cloud Pro stacks up against Claude Opus or GPT-5 on a coding benchmark.

The sparse on-device architecture is the real story here. 20B parameters activated per-prompt at 1-4B is clever engineering. If the quality claims hold up in real usage, it means your iPhone could handle tasks that currently require API calls. That's a cost and privacy win for developers building on Apple's framework.

But the RAM gatekeeping is going to sting. Most iPhone users don't have the Pro. Apple just told a huge chunk of their user base that their phone is the legacy version.

The five-model lineup

Instruction-Following Pruning: the real trick

The Google elephant in the room

What actually improved

The RAM tax is real

The privacy architecture is genuinely interesting

Not available everywhere

What surprised me

RELATED_ENTRIES

A City's AI Model Got Caught Lying About Its Origins

One token at a time is a bottleneck. Google broke it.

Zhipu just shipped 1M context to coding. Where are the benchmarks?