Open image models finally got a training pipeline that works

Image models have always been the poor cousins of open-weight LLMs. You could download 70B parameter language models that rival GPT-4, but good luck finding a competitive image model with released weights. FLUX changed that somewhat, but the gap between what proprietary tools like Midjourney produce and what you can actually run at home remained wide. Krea 2 tries to close it.

The company dropped two checkpoints this week: Krea 2 Raw and Krea 2 Turbo. Both are 12 billion parameter diffusion transformers, trained from scratch on real images, no synthetic data in the mix. The pitch is straightforward. Raw is the malleable base, designed for training LoRAs and fine-tunes. Turbo is the distilled production version that generates images in about 2 seconds on consumer hardware using just 8 inference steps.

This two-checkpoint approach is the real story, not the model itself. The standard workflow in open image generation has always been: download a model, hope someone made a LoRA for your use case, or fine-tune it yourself and deal with whatever inference characteristics you get. Krea designed these checkpoints to work as a pipeline. You train on Raw, then port your LoRA to Turbo for fast inference. The company claims LoRAs trained on Raw transfer cleanly to Turbo, which means you get the training flexibility of an undistilled model with the speed of a distilled one.

That's a meaningful architectural choice. Most open image models force you to pick: either you get a base model that's great for fine-tuning but slow to run, or a distilled model that's fast but stubborn about accepting new adaptations. Krea built both and explicitly optimized them for interop.

What's inside the box

The architecture is a single-stream diffusion transformer with some clever engineering tricks. Text encoding uses Qwen3-VL with multi-layer feature aggregation, which is the same approach that's been gaining traction in multimodal models. Instead of feeding the full VLM representation into the diffusion process, they use a shallow attention layer to aggregate features across VLM layers dynamically. The model picks which layers matter for a given prompt, coarse-to-fine.

For the MLP layers, they went with SwiGLU at 4x expansion, Grouped-Query Attention for efficiency, and 3D Axial Rotary Position Embedding for spatial awareness. The interesting bit is the timestep conditioning. Instead of the standard per-block MLP that modulates features at each transformer block, Krea replaced it with tunable bias terms. This cuts modulation parameters by 20 to 30 percent, freeing up compute for the attention and MLP layers that actually matter.

On the training side, they built a custom distributed framework using torchtitan and DTensor. Their data infrastructure, which they call "Krablet," uses PostgreSQL shards with a FOR UPDATE SKIP LOCKED pattern for job processing. It's the kind of unsexy engineering that determines whether your training run survives past 24 hours. And given that they report scaling instability at large GPU counts, where runs rarely lasted a full day without crashes, that infrastructure probably earned its keep.

The training pipeline flows through pretraining, midtraining, SFT, preference optimization, RL, and finally timestep distillation. For the RL stage, they use a multi-reward GRPO-style method with rubric-based rewards that decompose prompts into verifiable requirements. To prevent the model from gaming the reward system, they added a dedicated artifact reward model that specifically penalizes visual glitches and artifacts.

The zero-synthetic data bet

Here's where Krea diverges from the industry consensus. Every major image model lab, including Stability AI and Midjourney, has used AI-generated synthetic data in their training mixes. The reasoning is practical: synthetic data scales cheaply and fills gaps in the real dataset. The problem is model collapse. When you train on AI-generated images, the model reinforces its own biases and artifacts, creating a feedback loop that narrows the output distribution.

Krea's approach is to explicitly exclude all synthetic data from pretraining. They used in-house classifiers built on DINOv3 and SigLIP-2 to identify and remove AI-generated images from their training corpus. For quality filtering, they went a different route than the standard aesthetic score approach. Traditional aesthetic filters often throw out images with motion blur, shallow depth of field, or deliberate softness because those features score low on "sharpness" metrics. Krea built a sparse autoencoder on SigLIP-2 embeddings that can distinguish between genuine artifacts (compression noise, watermark remnants) and intentional artistic choices. The result is a training set that preserves stylistic range while filtering actual defects.

Their midtraining stage uses FAISS-based hierarchical clustering to ensure broad domain coverage. The system identifies gaps in the training distribution, like underrepresented categories of rare objects or niche visual domains, and prioritizes those for the next training batch. They specifically call out prioritizing rare entities from Wikipedia to ensure the model knows what a Mamenchisaurus looks like, not just what a cat looks like.

Speed and the production question

The benchmark that matters most for most users is generation speed. Krea 2 Turbo produces images in roughly 2 seconds. That's significantly faster than Midjourney's 3 to 6 seconds and orders of magnitude faster than GPT-Image-2's 200-plus seconds. FLUX.1 schnell is faster at 0.5 seconds, but FLUX.1 schnell is also a much smaller model with lower aesthetic quality.

The speed comes from Trajectory Distribution Matching, a distillation technique that operates at the trajectory level rather than matching individual denoising steps. Combined with running without classifier-free guidance (guidance scale set to 0.0), Turbo avoids the overhead that doubles or triples step counts in standard diffusion models.

The license is pragmatic but not truly open source. It's free for individuals and companies under 50 seats. Larger organizations need an enterprise license. The real constraint is the safety mandate: all users, regardless of size, must implement technical safeguards against generating illegal content, non-consensual intimate imagery, and CSAM. Krea claims no IP rights over user-generated content.

This puts Krea in an awkward middle ground. It's more open than Midjourney or DALL-E, but less open than FLUX.1 schnell's Apache 2.0 license. For most individual developers and small studios, the practical difference is negligible. For enterprises evaluating compliance, the mandatory safety implementation adds a non-trivial integration step.

The bigger picture is what this release signals about the open image model ecosystem. FLUX proved that competitive open image models exist. Krea 2 is pushing on a different axis: not just releasing weights, but designing an entire workflow around how those weights should be used. The Raw-to-Turbo pipeline is a bet that the future of open image generation isn't about having the best single model, but about having the best training-to-inference pipeline.

Whether that bet pays off depends on whether the community actually adopts the two-checkpoint workflow, or just grabs Turbo and ignores Raw entirely. Given how much of the open model ecosystem is built around quick inference rather than careful fine-tuning, I suspect most users will never touch Raw at all. But for the studios and researchers who do fine-tune, having a base model explicitly designed for that purpose is a real advantage over hacking LoRAs onto models that weren't built for it.

What's inside the box

The zero-synthetic data bet

Speed and the production question

RELATED_ENTRIES

Simulated agent training now beats the real thing

The AI company that refused the Pentagon wants your face now

That 3B model matched Claude Opus at math. Benchmarks broken?