PrismML released an image generation model that weighs less than a gigabyte and runs on an iPhone. Not a truncated, watered-down version. A 4-billion parameter diffusion transformer that generates 1024x1024 images in about 16 seconds. The weights are 0.93GB for the 1-bit variant, 1.21GB for the ternary one. Either way it's smaller than a single episode of a Netflix show.

I ran across this on the PrismML blog and had to double-check the numbers. This is FLUX.2 Klein 4B, Black Forest Labs' current open flagship, compressed through binary and ternary quantization until the transformer fits in the space most image models use for their tokenizer alone.


How the compression works

The core trick is extreme weight quantization. Normal models store each weight as a 16-bit float (FP16). The 1-bit Bonsai stores each weight as either -1 or +1, a single bit. The ternary variant adds a third state, 0, bringing it to about 1.58 bits per weight. About 5% of the model stays in FP16 (the projection layers) to preserve accuracy. Everything else gets binarized.

The result is a diffusion transformer that goes from 7.75GB (FLUX.2 Klein 4B FP16) down to:

Variant Transformer size Reduction vs FP16 Quality retention
1-bit (binary) 0.93 GB 8.3x 88%
Ternary 1.21 GB 6.4x 95%
FLUX.2 Klein 4B 7.75 GB 1.0x baseline

Total payload including text encoder and VAE comes to about 3.88GB for the ternary variant. That still fits comfortably on any modern phone.

Memory and speed benchmarks

The real numbers that matter for local deployment:

Scenario 1-bit memory Ternary memory FLUX.2 Klein FP16
512x512 generation 1.5 GB 1.96 GB 11.74 GB
1024x1024 generation 1.95 GB 2.38 GB 14.39 GB

That 14.39GB figure for the original FLUX.2 Klein is important. It means even an RTX 4090 with 24GB is the minimum for local FLUX. The Bonsai variants fit within the 8GB of a standard laptop GPU, the 6GB of an M4 MacBook Air, and the 6-8GB of an iPhone 17 Pro Max.

On an iPhone 17 Pro Max, the ternary variant generates a 512x512 image in 9.4 seconds. On an M4 Pro Mac, about 6 seconds. On an RTX 3080 via CUDA, a 1024x1024 image in 4.5 seconds. On an A100, 2.8 seconds.

There's also a WebGPU demo that runs entirely in your browser. I tested it. It works. Not as fast as native but it generates coherent images without sending a single API call to anyone.

What the community is saying

The Reddit response on r/LocalLLaMA has been overwhelmingly positive. The top comment calls it "really cooked" and notes the Apache 2.0 license. On r/StableDiffusion, people are posting comparisons between the ternary Bonsai and the original FLUX.2 Klein outputs. Most can't tell the difference in a blind comparison at 512x512.

Some caveats have emerged. The 1-bit variant at 1024x1024 shows visible quality degradation on complex prompts with multiple objects. Hands and fine text are where the compression shows its limits. The ternary variant at 88% quality retention handles these much better. For most practical use YouTube thumbnails, social media graphics, concept art sketches the ternary variant is indistinguishable from the original.

A few users noted that the initial setup has a gotcha: the GPU pipeline env variable needs to be set before the first serve.sh run. The team has acknowledged this and it's documented in the setup scripts now.

What this means for local AI

This is the first time a 4B-class image generation model has run on a phone at usable speeds. What matters more than the benchmarks is what this enables. You no longer need a cloud subscription or a dedicated GPU to generate high-quality images. The models are Apache 2.0, free for commercial use, and fully offline after download.

The compression technique is also transferable. PrismML has been pushing 1-bit and ternary quantization across their model lineup including LLM variants like Bonsai 1.7B (which runs at 290MB in a browser). If this approach generalizes, we could see similar compression applied to video generation, audio models, and multi-modal architectures.

The trade-off is real but manageable. You lose 5-12% quality depending on the variant. For quick iteration on a phone, that's a non-issue. The ability to generate a dozen image variants on a plane without an internet connection changes how you think about creative tools.

What surprised me

I keep thinking about the total download size. The entire Bonsai Studio iOS app plus model weights is under 4GB. That's less than Call of Duty. A full FLUX-quality image generator that fits in the space of a single video game. That comparison would have sounded absurd six months ago.

The other thing is the WebGPU demo. Seeing image generation work in a browser tab without any server component feels like a boundary being crossed. Not technically new the underlying tech is years old now but the packaging matters. When someone can share a link to a Hugging Face Space and the person on the other end generates 1024x1024 images in-browser on a laptop, that's a distribution milestone.

I'm not sure the 1-bit variant is worth the extra quality loss for most users. The ternary version at 1.21GB hits a sweet spot: 95% quality for 6.4x less memory. That 95% figure matches what you'd lose from JPEG compression on a photo you'd still post to Instagram. Good enough. The 1-bit version is impressive engineering but the ternary version is what you'd actually use.


Sources