The previous part argued that agentic AI should act freely inside a defined boundary. This part is about the machinery that makes the boundary real. A boundary that exists only as an instruction in a prompt — "do not transfer more than this amount", "do not touch production data" — is not a boundary; it is a suggestion, and agentic systems, like any complex software, do not always follow suggestions. Real boundaries are enforced by the system's architecture: by what the agent is technically able to do, not what it is told to do. The discipline is the same one security engineering has taught for decades, applied to a new kind of actor.
Permissions, not prompts
The cardinal rule of agentic safety is that constraints must be enforced at the level of permissions and capabilities, not at the level of instructions. If an agent must not move more than a certain sum, the system that moves money must reject larger transfers — regardless of what the agent attempts. If an agent must not access certain data, the data must be inaccessible to it, not merely off-limits by request. The reason is simple and hard-won: instructions can be misunderstood, circumvented, or overridden by the very flexibility that makes agents useful, while a permission the agent does not hold is a thing the agent simply cannot do.
If the only thing stopping your agent from doing harm is that you asked it not to, you have no control at all.
This reframing changes where you invest. Rather than perfecting the agent's instructions, you invest in the guardrails around its tools — the access controls, limits, and validations that hold no matter how the agent behaves. The agent operates inside a sandbox whose walls it cannot talk its way through.
Least privilege for agents
The foundational principle is least privilege: an agent should hold exactly the permissions its task requires, and no more. This is not new — it is the bedrock of access control — but agents make it both more important and more often neglected, because the temptation to grant broad access "so the agent can be flexible" is strong, and flexibility is the agent's selling point. Resist it. Every permission an agent holds is a permission it can misuse, and an agent with broad standing access is an agent with a broad blast radius. Least privilege confines the damage any single error can do.
Least privilege for agents has several practical facets:
- Scoped tools. Give the agent access only to the specific tools its task needs, each itself scoped to the minimum it requires. A tool that reads records need not also be able to delete them.
- Scoped data. Confine the agent's data access to what the task genuinely requires, ideally to the specific records in play rather than whole datasets.
- Time-bounded and task-bounded grants. Where possible, grant permissions for the duration and scope of a specific task rather than as standing entitlements, so the agent is not perpetually capable of actions it only occasionally needs.
- Separation of duties. High-consequence actions can require the cooperation of more than one actor or step, so no single agent action is sufficient to cause serious harm.
Tool design as a safety surface
The tools you give an agent are not neutral conduits; they are a primary place to build safety in. A well-designed tool enforces its own constraints, validates its inputs, and refuses unsafe requests, so that safety does not depend on the agent using it correctly. Several principles help:
- Validate at the tool boundary. Every tool should check that what it is being asked to do is within limits — magnitude, scope, format — and reject anything that is not, returning a clear error rather than acting.
- Make limits intrinsic. Caps on value, volume, and rate should live in the tool or the system behind it, where the agent cannot exceed them, not in guidance the agent is trusted to honour.
- Prefer reversible and staged operations. Where a tool performs something consequential, design it to be reversible, or to stage the action for confirmation, rather than executing irreversibly in one shot.
- Log every invocation. Each tool call — what was requested, what was permitted, what happened — is a line in the audit trail, and a well-instrumented tool produces this automatically.
Checkpoints for consequential actions
Not every action an agent might take should be within its autonomous authority. The boundary between "the agent may do this alone" and "this requires human approval" should be drawn explicitly, driven by consequence and reversibility. High-value, high-impact, and irreversible actions cross the line and route to a human checkpoint, where the agent's intent — ideally its visible plan from the previous part — is reviewed before execution. The art, as with all human oversight (Part 10), is to place checkpoints where they matter most rather than everywhere, so that human attention is reserved for the actions that genuinely warrant it and is not exhausted on routine ones.
Engineering for survivable failure
Even with least privilege, scoped tools, and checkpoints, you should assume errors will occur and engineer so that they are survivable. This brings together the blast-radius ideas from the previous part into concrete engineering:
- Circuit-breakers that halt an agent when it exceeds expected bounds — too many actions in a window, repeated failures, anomalous patterns — without waiting for a human to notice.
- Isolation so that an agent operating in one domain cannot reach into others; a failure stays contained where it started.
- Reliable stop mechanisms — a kill switch that immediately halts the agent and that the accountable owner has both the authority and the practical ability to use.
- Recovery paths that let you reverse or remediate an agent's actions, designed in advance rather than improvised during an incident.
Security and agents
One final dimension: an agent with permissions is a target. An attacker who can influence an agent — by manipulating its inputs, poisoning the data it reads, or hijacking its instructions — gains the agent's permissions and can turn its capabilities to harm. This is why least privilege is also a security control: it limits what a compromised agent can do. The next part takes up security and adversarial robustness in full, but the connection is worth flagging here, because an agent's permissions define not just what it can do when working correctly, but what an adversary can do through it when it is subverted. Designing an agent's access as if it might one day be turned against you is not paranoia; it is prudent engineering.
Why the prompt is not a control
It is worth dwelling on the difference between a constraint expressed as an instruction and one enforced as a permission, because the distinction is the whole of agentic safety and teams routinely get it wrong. An instruction — "do not transfer more than this amount", placed in the agent's prompt — relies on the agent choosing to comply. But the flexibility that makes an agent useful is exactly what makes it capable of not complying: it can misunderstand, it can be talked out of the rule by manipulated input, it can find an interpretation under which the rule does not seem to apply, or it can simply behave unexpectedly, as complex systems do. A permission, by contrast, is enforced by the system regardless of what the agent intends: if the money-moving service rejects transfers above the limit, the agent cannot exceed it, no matter what it tries or what an attacker tells it. The first is a request; the second is a wall. In regulated AI, where you must be able to assure a regulator that a boundary holds, only the wall counts. This reframing redirects where you invest your effort: not into perfecting the agent's instructions, which can always be subverted, but into the guardrails around its tools, which hold regardless of the agent's behaviour.
If your safety depends on the agent deciding to obey, you have a policy, not a control. Real boundaries are the ones the agent is incapable of crossing.
Least privilege as blast-radius design
Least privilege is often framed as a security principle, but for agents it is just as much a blast-radius principle: the permissions an agent holds define the maximum harm it can cause when something goes wrong, whether through its own error or through subversion by an attacker. This reframing makes the cost of over-provisioning vivid. An agent granted broad standing access "for flexibility" is an agent that, on its worst day, can cause broad harm — and its worst day will come, whether through a compounding error, a prompt injection, or an edge case nobody anticipated. An agent granted only the narrow, scoped, time-bounded permissions its task genuinely requires can, even at its worst, cause only narrow harm. Every permission you withhold is a category of damage you have pre-emptively prevented. The discipline is to grant the minimum — scoped tools, scoped data, task-bounded rather than standing grants, separation of duties for high-consequence actions — and to treat each additional permission as a deliberate expansion of the blast radius that must be justified, not a convenience to be granted by default.
Tools as the safety surface
The tools you give an agent are not passive conduits; they are the primary place to build safety in, because they sit at the boundary between the agent's intent and real-world effect. A well-designed tool does not trust the agent to use it correctly; it enforces its own constraints. It validates every request against limits — magnitude, scope, format — and refuses anything outside them with a clear error rather than acting. It makes its limits intrinsic, living in the tool or the system behind it where the agent cannot exceed them. Where it performs something consequential, it prefers reversible or staged execution over irreversible one-shot action. And it logs every invocation — what was requested, what was permitted, what happened — feeding the audit trail automatically. Designed this way, the tools form a layer of safety that holds independently of how the agent behaves: even an agent that is confused, manipulated, or simply wrong cannot make a well-designed tool do something unsafe. This is defence in depth applied to agency — the agent may err, but the tools it acts through will not let the error become a harm.
Assume subversion
A sound design posture for any agent with meaningful permissions is to assume that it will, at some point, be turned against its purpose — by a compounding error, by manipulated input, or by a deliberate attacker exploiting prompt injection. Designing under this assumption changes your choices. You grant least privilege not only to prevent honest errors but to limit what a subverted agent can do. You build circuit-breakers that halt anomalous behaviour without waiting for a human to notice. You isolate the agent's effects so a failure in one domain cannot reach others. You ensure reliable stop and recovery mechanisms exist and have been tested. And you log everything, so a subversion can be detected and reconstructed. The next part takes up security in full, but the connection belongs here: an agent's permissions are not just what it can do when working correctly — they are what an adversary can do through it when it is not. Designing an agent's access as though it might one day serve someone else's goals is not paranoia; it is the prudent recognition that capability, once granted, can be misdirected, and that the only reliable limit on the harm is the boundary you built around it.
In the next part: security and adversarial robustness — how AI systems are attacked, from data poisoning to model extraction to prompt injection, and how to defend them.
