AI Found 10,000 Bugs Before Humans Could Fix Them

For years, the cybersecurity industry worried that AI would make it too easy to find vulnerabilities. What nobody predicted was that finding them would become the easy part and fixing them would stay brutally hard.

The 30-Day Report Card

Last month, Anthropic launched Project Glasswing, giving 50 partner organizations access to Claude Mythos Preview, an unreleased model with superhuman code analysis capabilities. The initial results are out and they paint a strange picture.

The headline number: Over 10,000 high- or critical-severity vulnerabilities found across partner systems and 1,000+ open-source projects in 30 days. Partners report bug-finding rates increased by more than 10x.

But here is where it gets uncomfortable. Of those 10,000+ findings, only 75 have actually been patched with official advisories. The rest are sitting in triage queues, waiting for human maintainers who are already overwhelmed.

Cloudflare, one of the partners, found 2,000 bugs (400 high/critical) with a false-positive rate that beat human testers. Mozilla used the model to fix 271 vulnerabilities in Firefox 150, a 10x increase from their previous efforts with Claude Opus 4.6. IBM joined the consortium and is integrating the findings into their IBM Concert remediation pipeline.

The project scanned over 1,000 open-source projects in total and identified 23,019 issues. Six independent security research firms assessed 1,752 of the high/critical findings. Over 90% were validated as true positives. This is not a model that hallucinates bugs. It finds real ones with high precision.

The Bugs That Were Hiding in Plain Sight

Some of the findings read like security archaeology. Mythos found a 27-year-old remote crash vulnerability in OpenBSD, code that had been running untouched since before most working developers were born. A 16-year-old flaw in FFmpeg affects global streaming infrastructure. In WolfSSL, a cryptography library used by billions of devices, the model constructed an exploit that let it forge TLS certificates with a CVSS score of 9.1 (CVE-2026-5194).

These are not speculative vulnerabilities. 90.6% of the 1,752 assessed findings were confirmed as true positives. The model found vulnerabilities in every major operating system and every major web browser. It also chained Linux kernel vulnerabilities together in ways that could give an attacker complete control of a machine.

The Bottleneck Nobody Planned For

Anthropic's own update makes this explicit: "The bottleneck in fixing bugs like these is the human capacity to triage, report, and design and deploy patches for them. Finding them in the first place has become vastly more straightforward with Mythos Preview."

The numbers back this up. A high- or critical-severity bug found by Mythos takes an average of two weeks to patch. Some open-source maintainers have started asking Anthropic to slow down the disclosure pace because they cannot handle the volume of AI-generated bug reports.

This is a genuinely new problem. The security industry spent decades trying to find vulnerabilities faster. Now that an AI can, the bottleneck shifted overnight from discovery to remediation. The New York Times called the model's capabilities a "terrifying warning sign," but what is actually terrifying is how unprepared the ecosystem is to handle the output.

In one case, a Glasswing partner bank used the model to detect and prevent a fraudulent $1.5 million wire transfer. A threat actor had breached a customer's email account and made spoof phone calls. The model caught it. That is the upside. The downside is that the same capability in the wrong hands does not need a patch cycle.

What the Community Thinks

The HN reaction to the initial Mythos announcement was skeptical. "I'm getting tired of hearing about how every new iteration is going to spell doom," one top comment read. Another pointed out the more practical angle: "The patches could have been written by humans, it doesn't matter that much."

But the tone shifted when the actual numbers came out. Cloudflare's engineering team published a detailed writeup noting that the jump from general-purpose frontier models to Mythos is "not just a refinement of what came before." The model does not just find more bugs, it constructs end-to-end attack chains and writes exploits, something previous models could not do reliably. Cloudflare set up an agentic scanning harness that ran autonomously across their repositories.

The UK's AI Safety Institute independently assessed Mythos's cyber capabilities as a "step up" over other frontier models. XBOW, an autonomous offensive security platform, called it "substantially better than prior models" at analyzing code with a security mindset.

Where This Leaves Us

Anthropic is not releasing Mythos to the public. It launched a Cyber Verification Program for security researchers and is expanding the Glasswing toolset with custom scanning harnesses and threat model builders. But the decision to keep the model locked down has reopened a debate that has never been resolved: if one company holds a capability that far exceeds what the rest of the industry has, does that make the ecosystem safer or more dependent?

What is clear is that the vulnerability discovery bottleneck has been solved. The patching bottleneck, the human bottleneck, the organizational bottleneck of getting fixes deployed across billions of devices has not. The industry is shifting toward shorter patch cycles, with Oracle moving to monthly updates and Microsoft warning that its expected patch volume will "continue trending larger."

But none of that keeps pace with a model that finds 10,000 bugs in a month. AI can find bugs faster than every security team on the planet combined. It cannot make their users update.

So What

The most interesting thing about Project Glasswing is not that an AI found 10,000 bugs. It is that we now know the bug-finding problem was never the hard part. The hard part was always the slow, boring, unglamorous work of convincing people to install updates. AI does not fix that. It just makes the gap more visible.

The 30-Day Report Card

The Bugs That Were Hiding in Plain Sight

The Bottleneck Nobody Planned For

What the Community Thinks

Where This Leaves Us

So What

RELATED_ENTRIES

That 27B model was too big for a phone. Not anymore.

$4.40 per million tokens just matched the $200 tier

AI coding costs hit $2,000 per engineer and budgets blew up