Finding Bugs Got Cheap — Trusting the Fix Is the New Hard Part

Your AI agent can read a codebase it's never seen and surface a dozen plausible vulnerabilities before you've finished your coffee. A year ago that was a conference demo. Today it's a git clone and an API key.
Which means the hard part was never finding the bugs. The hard part is knowing which ones are real, which fix actually holds, and whether you can trust a machine's answer enough to act on it. That's not a tooling problem. It's a governance problem, and it's yours.
Last week Anthropic open-sourced a reference implementation of what they call the defender's loop — the pattern their own security teams use to find and fix vulnerabilities. It's worth reading, and not because the code is magic; it's explicitly a reference, not a product, and they sell the managed version separately. It's worth reading because it's public evidence for something we've been saying for a while.
When discovery becomes a commodity, trust becomes the bottleneck.
The number that gives it away
Anthropic frames the work as a six-step loop: threat model, sandbox, discover, verify, triage, patch. Discovery is the part everyone demos. Point enough agents at a codebase and they'll find things.
Then look at the number they published. As of May 2026, their own scanning had disclosed 1,596 vulnerabilities in open-source software. The count patched, to their knowledge: 97.
That gap is the whole story. Discovery scaled. Verifying, triaging, convincing a maintainer, and shipping a fix that doesn't break production did not.
A vulnerability nobody trusts enough to fix is just expensive noise.
If your security program still measures itself by findings-per-scan, you're optimizing the one thing that just got easy.
A machine can't grade its own homework
The smartest decision in the whole harness isn't the scanner. It's the grader.
A separate agent re-runs every finding in a clean room. It sees only the proof of concept — never the finder's reasoning — and it's told to assume the finding is wrong until proven otherwise. One agent's job is to catch everything; the other's is to be right. Ask a single model to do both and it quietly talks itself out of its best work.
And the bar for "real" is refreshingly blunt: build the exploit and run it. Not "this looks exploitable." Run it.
A finding you can't reproduce is a rumor with a CVE number.
This was never really about C code
Strip away the memory bugs and the sandboxes, and you're left with a general recipe for a question every team is about to face: how do you trust the output of an autonomous agent at all?
The answers are the ones we keep arriving at from the other side of the table. Separate the thing that acts from the thing that checks. Make the check adversarial. End every claim in something you can run, not something that merely reads well. And give the agent the smallest set of capabilities that does the job — the harness's patch tool has no "apply" button on purpose, because a capability that doesn't exist can't be hijacked by an instruction an attacker hid in the code you're scanning.
That detail is the whole game. The agent reading your codebase is also reading whatever someone else left in it.
The safest capability is the one your agent was never given.
A model can have deep context on your code and none at all on you — your trust boundaries, your blast radius, what "an acceptable risk" even means here. That gap doesn't close with a smarter model. It closes with scaffolding around the model, which is exactly what turns an impressive demo into something you'd let near production.
What to do with this on Monday
If you're running agents against your own systems — and if you ship software in 2026, you are, or you're about to be — the move isn't "buy a scanner."
Autonomy that outpaces your ability to verify it is just a faster way to be wrong.
Read the harness and steal the shape. A threat model that writes down what you actually trust. An independent verifier that has to prove its case. An oracle you can run. The smallest capability set that gets the job done, and not one permission more. Then ask the question the demo skips: who owns the fix, and how do you know it worked?
That last mile — the scaffolding, not the scanner — is the part we spend our time on with teams putting agents into production. It's less exciting than a model that finds a thousand bugs. It's also the only part that lets you believe the thousand-and-first.
The teams that win this won't be the ones whose agents find the most. They'll be the ones who can trust what their agents hand back.
Related Posts
Why We Failed Our Agent-Readiness Audit on Purpose
An automated audit gave us nine recommendations for being 'agent-ready.' We shipped three and deliberately failed the other six — because a security firm's agent-readiness is measured by signal honesty, not checkbox coverage.
The AppSec Acceleration: Why Your Security Tools Can't See Agent Vulnerabilities
Traditional SAST, DAST, and SCA tools were built for request-response architectures. Agent-first systems have vulnerability classes these tools were never designed to detect — and independent research just confirmed it.
Specification as Attack Surface: Why Ambiguity Is a Vulnerability in Agent-First Architectures
Ambiguous specifications aren't just a project management problem anymore. In agent-first architectures, every gap in a spec is a potential security boundary violation — and the agent won't tell you it's guessing.