Your 11.9B inpainting model just got outperformed by something 50x smaller

The image inpainting community has been chasing one goal for years: fill in the missing parts of a photo without it looking like garbage. The current best solutions, FLUX.1-Fill-Dev and SD3.5 Large-Inpainting, pack 10+ billion parameters and need serious GPU hardware. A new paper from researchers at Huazhong University and VIVO AI Lab says you don't need any of that. Their model, Moebius, gets the same results with 0.22B parameters. That's less than 2% of FLUX's size.

What Moebius actually does

Moebius is a specialized inpainting model. It doesn't try to generate images from text, do style transfer, or handle a dozen tasks. It does one thing: take a masked region in an image and fill it with contextually correct pixels. And it does that one thing extremely well.

The key innovation is a block called LλMI (Local-λ Mix Interaction). Standard diffusion models use self-attention and cross-attention to understand spatial relationships in images, but both scale quadratically with resolution. LλMI compresses spatial context and global semantics into fixed-size linear matrices instead. The result: it preserves the complex latent interactions that make inpainting look natural, but without the computational overhead that normally comes with them.

Think of it like this. A normal attention mechanism is a library where every book can reference every other book. Moebius's LλMI block is a library where books are organized into sections that reference each other efficiently. You still get the connections, but the librarian isn't drowning.

How they got a 0.22B model to behave like a 10B one

The architecture alone wouldn't be enough. A 0.22B model simply doesn't have the representational capacity to match a 10B generalist on complex texture completion or facial plausibility. So the team paired Moebius with an adaptive multi-granularity distillation strategy.

The teacher model is PixelHacker, a large inpainting specialist also from the same lab. The distillation happens entirely in latent space (not pixel space, which would be expensive), and it operates at multiple granularity levels simultaneously: microscopic intermediate features, macroscopic diffusion trajectories, and everything in between.

The trick is dynamic gradient norm balancing. During training, different loss terms compete. Some want the student to mimic the teacher's fine details, others want it to follow the teacher's overall denoising path. Moebius uses an adaptive weighting mechanism that adjusts these losses based on their gradient norms, so no single objective dominates and the student model doesn't hit representation saturation.

The authors describe this as mapping the "architecture-distillation combination frontier." In plain language: they found the sweet spot where the architecture's design and the distillation strategy amplify each other.

Benchmark results

Moebius was evaluated on six benchmarks across natural scenes (Places2) and portrait scenes (CelebA-HQ, FFHQ). Here's how it stacks up:

Model	Parameters	Inference Time	Quality (vs FLUX)
FLUX.1-Fill-Dev	11.9B	~400ms	Baseline
SD3.5 Large-Inpainting	~8B	~350ms	Comparable
Moebius	0.22B	26ms	Matches or beats

On facial plausibility benchmarks (CelebA-HQ, FFHQ), Moebius actually outperforms both FLUX and SD3.5. On complex texture completion in natural scenes, it matches them. The inference speed advantage is roughly 15x on a single GPU.

That 26ms per step number is worth pausing on. It means Moebius could realistically run in real-time interactive applications on consumer hardware. Try running FLUX on a laptop and you'll understand why that matters.

The "impossible triangle"

The authors frame their work around what they call the "Impossible Triangle" of AI: low parameters, fast inference, and high quality. You traditionally pick two. Moebius claims all three.

This isn't just an academic exercise. The practical implications are significant. If you're building a photo editing app, a game texture pipeline, or any system that needs inpainting at scale, the cost difference between running a 0.22B model and an 11.9B model is enormous. Not just in compute, but in memory, energy, and latency.

What surprised me

The fact that distillation from a teacher (PixelHacker) could close the capacity gap this completely is genuinely surprising. There's a common assumption in the field that once a student model is below a certain size threshold, no amount of distillation can make up for the missing parameters. Moebius suggests that threshold might be lower than we thought, at least for task-specific models.

The other thing that catches my attention: Moebius is task-specific. It only does inpainting. Generalist models like FLUX handle dozens of tasks. There might be a broader lesson here about specialization vs. generalization in model design. When you focus on one task and optimize everything around it, you can sometimes beat a model 50x your size.

Don't confuse it

Moebius is NOT a general-purpose image generator. It won't create images from text prompts. It won't do style transfer or outpainting. It fills in masked regions of existing images. If you need text-to-image generation, you still want FLUX or SD3.5. If you need inpainting specifically, Moebius is now the efficiency benchmark.

Sources

Paper: https://arxiv.org/abs/2606.19195
Project page: https://hustvl.github.io/Moebius
GitHub: https://github.com/hustvl/Moebius
HuggingFace: https://huggingface.co/hustvl/Moebius
Teacher model (PixelHacker): https://github.com/hustvl/PixelHacker

What Moebius actually does

How they got a 0.22B model to behave like a 10B one

Benchmark results

The "impossible triangle"

What surprised me

Don't confuse it

Sources

RELATED_ENTRIES

The smallest model in the room just took charge

A rocket company just bought your coding agent for $60B

59% SWE-Bench score from a model costing $0.30 per million tokens