Skill Was Never the Threat — Scaffolding Is

July 2, 20265 min readAtypical Tech

ai-agents security scaffolding safe-autonomy

Illustration for Skill Was Never the Threat — Scaffolding Is

Everyone is worried about what the model can do. That's the wrong question.

A language model, in isolation, is remarkably inert. It reasons, plans, generates, synthesizes — all contained within a text-in, text-out boundary. It can't touch your database without a database client. It can't send an email without an email tool. It can't execute a plan without something to execute it with. The moment you hand it those capabilities, you've crossed a line — and that crossing happens entirely in your scaffolding, not inside the model.

Anthropic's red-team findings have been valuable and appropriately unsettling. When frontier models exhibit novel behaviors under adversarial conditions, that's worth taking seriously. But there's a category error lurking in how most teams respond to those findings: they treat model capability as the thing to contain. They spend energy evaluating the model, red-teaming the model, debating the model's values. The model becomes the object of security.

That's the misdirection.

A model's capabilities are the ceiling — your scaffolding decides how close to it you deploy.

The Harness Is the Attack Surface

When you put an agent in production, you build a harness around it. That harness is responsible for a remarkable set of things.

It gives the model tools — file system access, shell execution, API credentials, database connections. It manages memory — what context persists across sessions, what user data gets surfaced in the system prompt. It defines triggers — what events cause the agent to wake up and act, and it controls flow — what can be called, in what order, with what guardrails.

None of that lives inside the model. All of it is yours to design, deploy, and secure.

An adversary who successfully prompts a model into "wanting" to exfiltrate data has accomplished nothing if the scaffolding doesn't give that model access to data worth exfiltrating. An adversary who compromises the scaffolding's memory layer — poisoning the persistent context the model reads on every invocation — has potentially influenced every downstream action without ever touching the model itself.

Prompt injection isn't really about the prompt — it's about what happens when the scaffolding trusts poisoned input.

The security conversation around AI agents has been dominated by prompt injection because it's visible and dramatic. But the deeper issue is architectural. Most scaffolding gets built capability-first and security-second, because that's how you ship features. You add a tool when you need it, wire in a memory layer when context gets unwieldy, hook up a trigger when a workflow demands it. Nobody pauses to ask what the harness looks like from the outside — until something goes wrong.

Where ROBOT Places the Boundary

Atypical Tech's Safe Autonomy framework — ROBOT — exists specifically because this distinction matters. Two of its five elements, Boundaries and Trust, are almost entirely scaffolding concerns. Not model concerns. Scaffolding concerns.

Boundaries define what an agent is allowed to do: which tools it can invoke, which resources it can access, what actions require human approval before proceeding. These aren't properties of the model. They're constraints you encode in the scaffolding — through permission systems, tool configurations, policy layers, and human-in-the-loop hooks you actually build and maintain.

Trust defines what input the agent should treat as authoritative. A model doesn't inherently know whether a retrieved document is trustworthy context or an adversarial injection designed to redirect its behavior. The scaffolding's retrieval architecture, its input validation, its source attribution logic — that's where trust decisions get made or neglected.

You can't ROBOT-harden a model — you ROBOT-harden the scaffolding that surrounds it.

When you audit an agentic deployment for safety, the right question isn't "Is this model aligned?" It's "Does this scaffolding encode the right Boundaries? Does it establish Trust at every input boundary? Does it minimize blast radius when something inevitably goes sideways?" The model is a participant in those answers, not the answer itself.

What This Changes

If you accept that the scaffolding is the attack surface, a few things shift immediately.

Security reviews for AI systems can't stop at model evaluation. They have to trace the entire harness: every tool definition, every memory layer, every trigger, every API credential the agent can reach. A model can behave impeccably in isolation and a deployment can still be catastrophically unsafe — because the scaffolding handed it keys it never should have had.

Vendor risk assessments change too. When you adopt a third-party agentic framework, you're not just adopting a model host — you're inheriting scaffolding decisions someone else made under their own constraints and priorities. What tools does it expose by default? How does it handle memory persistence? What's the default trust posture for external input? These are the questions that should drive your evaluation, not the benchmark scores.

And skill design turns out to matter more than it might seem at first. A skill — a discrete, versioned capability you give an agent — is a contract. It should declare what it does, what it touches, and what access it needs. The narrower that contract, the smaller the surface the scaffolding has to defend.

Skills should be small and honest; scaffolding should be explicit and paranoid.

The least safe agent architecture is the one where capability grew organically — tools added as needed, memory layered in, triggers wired ad hoc — and nobody ever stepped back to ask what the whole harness looked like from the outside. That's how you end up with a system that behaves predictably in demos and dangerously in production.

The model was never your biggest problem. The question was always what you built around it.

The 97% Number: What Happened When We Automated Security Triage

We automated security alert triage to a 97% noise-reduction rate — then a suppression rule went stale and nearly let a production vulnerability through unnoticed.

Why We Failed Our Agent-Readiness Audit on Purpose

An automated audit gave us nine recommendations for being 'agent-ready.' We shipped three and deliberately failed the other six — because a security firm's agent-readiness is measured by signal honesty, not checkbox coverage.

Project Glasswing: AI Finds Zero-Days Faster Than Humans Can Patch Them

Anthropic's Project Glasswing deployed Claude Mythos Preview to autonomously discover thousands of zero-days with a 72.4% exploit success rate. Less than 1% of findings have been patched. The bottleneck is no longer discovery — it's everything that comes after.