The agent threat model

AI agent security: the threat model for agentic systems

By Wibo · Amsterdam Published 26 Jun 2026 Last updated 26 Jun 2026 ~8 min read

Short answer

AI agent security is the practice of designing an agentic system so that taking actions via tools, on input that may be untrusted, cannot be turned against you. Agents add a new attack surface over a plain chatbot: they read external content, decide for themselves, and call tools that change real state. The dominant risks — prompt injection (direct and indirect), excessive agency, insecure tool use, sensitive-data exposure, and supply chain — are catalogued in the OWASP Top 10 for LLM Applications, and you mitigate them with least-privilege tools, bounded loops, human-in-the-loop on high-impact actions, and disciplined input/output handling.

This page is the overview — the whole threat model. For the specific attack, see prompt injection; for mitigation tooling, see AI guardrails; for the bounded loop that contains excessive agency, see the bounded agentic loop.

Why agents are a new attack surface

A chatbot returns text. An agent takes actions: it queries databases, calls APIs, writes files, sends messages — often on content it fetched from the open web or a shared inbox. Three properties combine to make this dangerous in a way a single completion never is.

It acts. The model's output is no longer just shown to a person; it drives tool calls that change state. A bad decision has consequences, not just a wrong answer.
It reads untrusted input. Anything the agent ingests — a web page, a PDF, an email, a tool result — can carry instructions. To the model, retrieved content and your system prompt are the same kind of text.
It decides for itself. The agent chooses the next step in a loop. Autonomy is the feature; it's also what lets a single hijacked instruction cascade into many actions.

Simon Willison calls the worst combination the lethal trifecta: an agent that has access to private data, is exposed to untrusted content, and can communicate externally. Hold all three and prompt injection becomes data exfiltration. Remove any one leg — scope the data, sandbox the input, or cut the outbound channel — and the attack path collapses. That framing is the fastest way to audit a design: which legs does this agent hold at once?

The agent threat model (OWASP LLM Top 10)

The OWASP Top 10 for LLM Applications (2025) is the shared vocabulary for these risks. Not all ten are agent-specific, but the ones below are where agentic systems are won or lost. Map each threat to a mitigation you've actually built — design for them, don't patch them on.

Threat (OWASP LLM)	What it means for an agent	Primary mitigation
Prompt injection (LLM01)	Untrusted input overrides the agent's instructions — directly from a user, or indirectly via content the agent later ingests (a page, doc, or tool result).	Treat all model input as untrusted; isolate untrusted content; constrain tools. See prompt injection.
Excessive agency (LLM06)	The agent can do more than the task needs — too many tools, too-broad permissions, or acting on high-impact operations without approval.	Least-privilege tools, scoped credentials, human-in-the-loop, a bounded loop.
Sensitive information disclosure (LLM02)	Secrets, PII, or another user's data leak into the context window and out through a response or tool call.	Minimise data in context, scope retrieval per user, redact, enforce auth downstream.
Improper output handling (LLM05)	Model output is passed unchecked into a shell, SQL query, browser, or another system — turning a suggestion into code execution.	Never trust model output as a command; validate, encode, and sandbox before it reaches a sink.
Supply chain (LLM03)	A compromised model, dependency, or third-party tool — including an MCP server — gives an attacker a foothold inside the agent.	Vet and pin models, tools, and MCP servers; least privilege for every integration.
Unbounded consumption (LLM10)	A runaway loop or adversarial input drives unbounded tool calls, token spend, or downstream load.	Turn and tool-call budgets, rate limits, cost caps, and escalation. See guardrails.

OWASP's list is descriptive, not a checklist you complete once. The point is coverage: for every action your agent can take, you should be able to name which of these threats it exposes and which control answers it.

Excessive agency and insecure tool use

Excessive agency (LLM06) is the risk most unique to agents, because it only exists once a system can act. OWASP breaks it into three root causes, and each maps to a concrete design lever:

Excessive functionality — the agent has access to tools or capabilities the task never needs (a leftover delete endpoint, a shell it rarely uses). Fix: expose the minimum set of tools, scoped to the job.
Excessive permissions — a tool's credential is broader than required: a read task holding write access, a token scoped to a whole tenant instead of one user. Fix: least-privilege, per-user credentials, and enforce authorisation in the downstream system, not in the prompt.
Excessive autonomy — the agent executes high-impact actions without a human in the loop. Fix: require approval for irreversible or sensitive operations; let the agent draft, a person commit.

Insecure tool use is the same problem from the tool's side. A tool is an attack surface: its inputs come from a model that may be under an attacker's influence, so it must validate arguments, refuse out-of-scope requests, and fail closed. The cardinal rule is that the model is not a security boundary — never rely on instructions in the prompt to keep an agent in line when the credential itself could simply be narrower. The MCP security guidance says the same for standardised tools: fine-grained access, mutual authentication, and sanitised tool results.

Designing a secure agent (principles)

Security for agents is architectural, not a filter you bolt on at the end. Five principles cover most of the surface above:

Least-privilege tools. Each tool exposes the minimum capability, with the narrowest credential, scoped per user or session. A confused or hijacked agent can't reach further than its tools allow — so make the tools small.
Bounded loops. Cap turns and tool calls with a budget and an escalation path, so a hijacked or looping agent stops instead of running away. This is the structural answer to excessive agency and unbounded consumption — detailed in the bounded agentic loop.
Human-in-the-loop. Put a person on the high-impact, irreversible actions — sending money, deleting data, emailing customers. The agent proposes; a human approves. Autonomy is graduated, not all-or-nothing.
Input and output handling. Treat everything the model reads as untrusted and everything it emits as unvalidated. Keep untrusted content out of the instruction channel where you can, and never pass output into a shell, query, or browser without validation. This is where prompt injection is contained and guardrails earn their keep.
Isolation. Sandbox tool execution, separate per-user state and credentials, and break the lethal trifecta by design — so a single compromise can't read private data and exfiltrate it in the same breath.

None of these is exotic. They're the operational plane of a production agentic architecture — the layer that separates a demo from something you can safely leave running. Anthropic frames the same goal in its framework for safe and trustworthy agents: keep humans in control, scope what the agent can touch, and make its behaviour observable.

Frequently asked questions

What is AI agent security?

AI agent security is the practice of designing an agentic AI system — one that takes actions via tools, often on untrusted input — so those actions cannot be subverted. It covers the threats agents add over a plain chatbot (prompt injection, excessive agency, insecure tool use, data exposure, supply chain) and the controls that contain them: least-privilege tools, bounded loops, human-in-the-loop, and disciplined input/output handling.

What are the main security risks of AI agents?

The agent-relevant entries in the OWASP Top 10 for LLM Applications: prompt injection (LLM01, including indirect injection from ingested content), excessive agency (LLM06), sensitive information disclosure (LLM02), improper output handling (LLM05), supply chain (LLM03), and unbounded consumption (LLM10). Prompt injection plus excessive agency is the combination that turns a model error into a real-world action.

What is excessive agency?

Excessive agency (OWASP LLM06) is the risk that an agent can do more than its task requires, so manipulated or ambiguous output causes damaging actions. OWASP names three root causes: excessive functionality (too many tools), excessive permissions (credentials broader than needed), and excessive autonomy (acting on high-impact operations without human approval). You reduce it with least-privilege tools, scoped credentials, and human-in-the-loop on sensitive actions.

How do you secure an AI agent?

Design for it across the whole system: give each tool the minimum capability and a narrow, per-user credential; bound the loop with turn and tool-call budgets; require human approval for irreversible actions; treat all model input as untrusted and all output as unvalidated before it reaches a sink; and isolate execution and per-user state. Enforce authorisation in the downstream system, never in the prompt — the model is not a security boundary.

What is the OWASP Top 10 for LLMs?

The OWASP Top 10 for LLM Applications is an open, community-maintained list of the most critical security risks in applications built on large language models, published by the OWASP Gen AI Security Project. The 2025 edition ranks prompt injection first and includes excessive agency, sensitive information disclosure, improper output handling, supply chain, and unbounded consumption — the shared vocabulary for the agent threat model.

How do you limit what an agent can do?

Through least privilege and bounded autonomy. Expose only the tools the task needs, scope each tool's credential to the narrowest permission and to a single user or session, and enforce that authorisation in the tool and downstream system rather than trusting the prompt. Cap the loop with a turn and tool-call budget, and gate high-impact or irreversible actions behind human approval so the agent drafts but a person commits.

Sources & provenance

OWASP Top 10 for LLM Applications (2025), OWASP Gen AI Security Project — risk list and IDs: genai.owasp.org/llm-top-10. Prompt injection (LLM01): LLM01; excessive agency (LLM06): LLM06.
Simon Willison, "The lethal trifecta for AI agents" (16 Jun 2025): simonwillison.net.
Anthropic, "Our framework for developing safe and trustworthy agents": anthropic.com.
Model Context Protocol — Security Best Practices: modelcontextprotocol.io.
Design principles synthesized with AI Architect Academy's curriculum (Track B, agentic systems) and the platform's own bounded-loop build (src/lib/coach.ts).

Threat taxonomies and vendor guidance evolve; verify against the live OWASP and vendor docs before building. Corrections: hello@aiarch.dev.

Learn to design agents that are safe to run.

AI Architect Academy teaches the operational plane — least-privilege tools, bounded loops, guardrails, and human-in-the-loop — as first-class skills, on a platform that is itself a production agentic system built across Anthropic, AWS, and Cloudflare. The build is the curriculum.

Try a sample lesson free → Browse the curriculum

Free sample — no signup · every claim cited · cancel anytime