The image inpainting community has been chasing one goal for years: fill in the missing parts of a photo without it looking like garbage. The current best solutions, FLUX.1-Fill-Dev and SD3.5 Large-Inpainting, pack 10+ billion parameters and need serious GPU hardware. A new paper from researchers at Huazhong University and VIVO AI Lab says you don't need any of that. Their model, Moebius, gets the same results with 0.22B parameters. That's less than 2% of FLUX's size.
What Moebius actually does
Moebius is a specialized inpainting model. It doesn't try to generate images from text, do style transfer, or handle a dozen tasks. It does one thing: take a masked region in an image and fill it with contextually correct pixels. And it does that one thing extremely well.
The key innovation is a block called LλMI (Local-λ Mix Interaction). Standard diffusion models use self-attention and cross-attention to understand spatial relationships in images, but both scale quadratically with resolution. LλMI compresses spatial context and global semantics into fixed-size linear matrices instead. The result: it preserves the complex latent interactions that make inpainting look natural, but without the computational overhead that normally comes with them.
Think of it like this. A normal attention mechanism is a library where every book can reference every other book. Moebius's LλMI block is a library where books are organized into sections that reference each other efficiently. You still get the connections, but the librarian isn't drowning.
How they got a 0.22B model to behave like a 10B one
The architecture alone wouldn't be enough. A 0.22B model simply doesn't have the representational capacity to match a 10B generalist on complex texture completion or facial plausibility. So the team paired Moebius with an adaptive multi-granularity distillation strategy.
The teacher model is PixelHacker, a large inpainting specialist also from the same lab. The distillation happens entirely in latent space (not pixel space, which would be expensive), and it operates at multiple granularity levels simultaneously: microscopic intermediate features, macroscopic diffusion trajectories, and everything in between.
The trick is dynamic gradient norm balancing. During training, different loss terms compete. Some want the student to mimic the teacher's fine details, others want it to follow the teacher's overall denoising path. Moebius uses an adaptive weighting mechanism that adjusts these losses based on their gradient norms, so no single objective dominates and the student model doesn't hit representation saturation.
The authors describe this as mapping the "architecture-distillation combination frontier." In plain language: they found the sweet spot where the architecture's design and the distillation strategy amplify each other.
Benchmark results
Moebius was evaluated on six benchmarks across natural scenes (Places2) and portrait scenes (CelebA-HQ, FFHQ). Here's how it stacks up:
| Model | Parameters | Inference Time | Quality (vs FLUX) |
|---|---|---|---|
| FLUX.1-Fill-Dev | 11.9B | ~400ms | Baseline |
| SD3.5 Large-Inpainting | ~8B | ~350ms | Comparable |
| Moebius | 0.22B | 26ms | Matches or beats |
On facial plausibility benchmarks (CelebA-HQ, FFHQ), Moebius actually outperforms both FLUX and SD3.5. On complex texture completion in natural scenes, it matches them. The inference speed advantage is roughly 15x on a single GPU.
That 26ms per step number is worth pausing on. It means Moebius could realistically run in real-time interactive applications on consumer hardware. Try running FLUX on a laptop and you'll understand why that matters.
The "impossible triangle"
The authors frame their work around what they call the "Impossible Triangle" of AI: low parameters, fast inference, and high quality. You traditionally pick two. Moebius claims all three.
This isn't just an academic exercise. The practical implications are significant. If you're building a photo editing app, a game texture pipeline, or any system that needs inpainting at scale, the cost difference between running a 0.22B model and an 11.9B model is enormous. Not just in compute, but in memory, energy, and latency.
What surprised me
The fact that distillation from a teacher (PixelHacker) could close the capacity gap this completely is genuinely surprising. There's a common assumption in the field that once a student model is below a certain size threshold, no amount of distillation can make up for the missing parameters. Moebius suggests that threshold might be lower than we thought, at least for task-specific models.
The other thing that catches my attention: Moebius is task-specific. It only does inpainting. Generalist models like FLUX handle dozens of tasks. There might be a broader lesson here about specialization vs. generalization in model design. When you focus on one task and optimize everything around it, you can sometimes beat a model 50x your size.
Don't confuse it
Moebius is NOT a general-purpose image generator. It won't create images from text prompts. It won't do style transfer or outpainting. It fills in masked regions of existing images. If you need text-to-image generation, you still want FLUX or SD3.5. If you need inpainting specifically, Moebius is now the efficiency benchmark.
Sources
- Paper: https://arxiv.org/abs/2606.19195
- Project page: https://hustvl.github.io/Moebius
- GitHub: https://github.com/hustvl/Moebius
- HuggingFace: https://huggingface.co/hustvl/Moebius
- Teacher model (PixelHacker): https://github.com/hustvl/PixelHacker