Security and Adversarial Robustness (Building Regulated AI: From Principles to Production)

An AI system is software, and so it inherits all the security concerns of any software — but it also faces a category of threat that is distinctly its own. The model itself can be attacked: its training data poisoned, its inputs manipulated to fool it, its parameters stolen, its training data reconstructed, or its instructions hijacked. These attacks exploit properties of machine learning that have no counterpart in conventional systems, and in regulated AI they are not merely a security problem but a governance obligation, because a system that can be subverted cannot be trusted to make the decisions it was approved to make. This part maps the adversarial landscape and the defences that belong in a regulated-AI programme.

Why AI security is different

Conventional security protects the confidentiality, integrity, and availability of systems and data. AI security must protect all of that and the integrity of the model's behaviour itself — a model that has been subtly manipulated may continue to run perfectly while producing attacker-chosen outputs. The attack surface includes the training data, the model, the inputs at inference time, and the outputs, each of which can be targeted in ways that leave the surrounding system apparently healthy. This is why AI security cannot be left entirely to a general security team unfamiliar with machine learning; it requires understanding how models can be turned against their purpose.

The adversarial threat landscape

It helps to organise AI-specific threats by what they target.

Data poisoning

An attacker who can influence the training data can corrupt the model. By injecting carefully crafted examples, they can degrade the model's performance, introduce specific blind spots, or plant a hidden trigger — a "backdoor" — that causes chosen behaviour when a particular pattern appears. Poisoning is insidious because the resulting model looks normal on ordinary data; the corruption only manifests under the attacker's conditions. It connects directly to the data-governance discipline of Part 7: knowing the provenance of your training data is also a security control, because data of unknown origin is a poisoning risk.

Evasion attacks

At inference time, an attacker can craft inputs designed to fool the model — adversarial examples that are subtly perturbed to produce a wrong output while appearing normal to a human. In a regulated context, evasion lets an adversary defeat the very purpose of a system: slipping fraudulent transactions past a detector, evading a content filter, or gaming an eligibility model. Wherever there is an incentive to beat your model, assume someone will try.

Model extraction

By querying a model and observing its outputs, an attacker can reconstruct an approximation of it — stealing intellectual property and, worse, building a copy on which to develop evasion attacks at leisure. Systems that expose model outputs broadly are most exposed.

Model inversion and membership inference

Some attacks target the data behind the model. Inversion attempts to reconstruct sensitive training data from the model's behaviour; membership inference attempts to determine whether a particular individual's data was in the training set. Both are privacy attacks delivered through the model, linking security directly to the privacy obligations of Part 8 — a model can leak personal data even when the data itself is locked away.

Prompt injection and instruction hijacking

For systems driven by natural-language instructions, and for agents especially, a potent threat is prompt injection: malicious content placed where the system will read it — in a document, a web page, a user message — that hijacks the system's instructions and redirects its behaviour. For an agent with permissions (Part 14), a successful injection hands the attacker the agent's capabilities. This is among the most pressing security concerns for modern agentic systems, and it has no clean, complete solution, which makes the containment and least-privilege measures of the previous part all the more important.

An attacker who hijacks an agent inherits everything the agent can do. The agent's permissions are the attacker's permissions.

Defending AI systems

There is no single defence that makes an AI system secure; robustness comes from layers, in the same defence-in-depth spirit as conventional security. The major directions:

Protect the data supply chain. Establish provenance for training data, validate it, and guard the pipeline against tampering. Much poisoning risk is mitigated by knowing and controlling where data comes from.
Harden the model. Techniques exist to make models more robust to adversarial inputs — training on adversarial examples, detecting anomalous inputs, and reducing sensitivity to small perturbations. None is complete, but each raises the cost of attack.
Control access to the model. Limit who can query the model and how much, monitor for the query patterns that signal extraction or probing, and avoid exposing more of the model's behaviour than necessary. Rate limits and access controls blunt extraction and evasion alike.
Validate inputs and isolate untrusted content. For injection, treat any content the system reads from outside as untrusted, separate instructions from data where possible, and constrain what the system can do in response. Combined with least privilege, this limits the damage a successful injection can cause.
Monitor behaviour. Many attacks reveal themselves through anomalous behaviour — unusual inputs, shifts in output patterns, spikes in particular requests. The monitoring discipline of the next part is also a security control.
Contain the blast radius. Assume some attacks will succeed and ensure that a compromised model or agent can do only limited harm — the containment principles of Parts 13 and 14 applied with a security lens.

Security as a governance obligation

In regulated AI, security is not a separate concern handled by another team after the model is built; it is part of whether the system can be trusted to do its job, and therefore part of governance. Several threads make this concrete. Validation (Part 12) should include adversarial robustness — a system that has not been probed for how it fails under attack has not been fully validated. The threat model belongs in documentation (Part 11), so that the system's security assumptions and limitations are recorded and assessable. Incident response (a later part) must contemplate security incidents, not just performance failures. And the accountable owner (Part 4) owns security risk along with every other risk the system carries. A regulator examining a high-risk AI system will increasingly ask how it is protected against manipulation, and "we secured the servers" is not an answer to "how do you know your model has not been poisoned or hijacked?"

An evolving arms race

Finally, AI security is a moving target. Attacks evolve, new vulnerabilities are discovered, and defences that suffice today may not tomorrow — particularly for fast-moving areas like agentic systems and prompt injection, where the research is young and no defence is yet complete. This argues for humility and for designing on the assumption that some attacks will get through: layered defences, least privilege, containment, and monitoring, so that the failure of any single control does not mean the failure of the system. Security maturity in AI is not the belief that you have made the system impregnable; it is the discipline of making attacks costly, detecting them quickly, and ensuring that when one succeeds, the harm is bounded and recoverable.

The backdoor you cannot see

Of all the adversarial threats, data poisoning deserves a closer look, because it is the one most likely to pass every ordinary check. An attacker who can influence even a small portion of training data can plant a backdoor: the model behaves perfectly on all normal inputs, passing validation and performing well in production, but produces an attacker-chosen output whenever a specific, secret trigger pattern appears. The insidious part is that nothing looks wrong. The model's accuracy is excellent, its fairness checks pass, its behaviour on every test case the team thinks to try is impeccable — because the corruption only manifests under conditions the team does not know to test. This is why data provenance, introduced as a data-governance discipline in Part 7, is also a frontline security control: a model is only as trustworthy as the data it learned from, and data of unknown origin is a poisoning risk you cannot detect after the fact. Knowing and controlling where training data comes from is not merely good data hygiene; it is one of the few defences against an attack that, by design, leaves no trace in the model's ordinary behaviour.

A poisoned model passes your tests. That is the point of poisoning it. Your defence is not testing the model harder but controlling the data it was built from.

Prompt injection: the unsolved problem

For systems driven by natural-language instructions, and for agents above all, prompt injection is among the most pressing and least solved security threats, and honesty about its difficulty is itself a form of maturity. The attack is simple to state: malicious content placed where the system will read it — in a document it processes, a web page it visits, a message it receives — hijacks the system's instructions and redirects its behaviour. For an agent holding permissions, a successful injection hands the attacker those permissions, turning the agent's capabilities to the attacker's ends. What makes this so hard is that the system's flexibility — its ability to follow instructions expressed in natural language — is exactly what the attack exploits, and there is no clean, complete defence that preserves that flexibility. Partial mitigations help: treating all externally-sourced content as untrusted, separating instructions from data where the architecture allows, constraining what the system can do in response to content it reads, and — crucially — falling back on the least-privilege and containment measures of the previous part, so that even a successful injection can only do limited harm. The realistic posture is not to assume injection can be prevented but to design so that its success is survivable, which is why the security and agentic-safety disciplines are so tightly bound.

Security as a layered, evolving defence

No single measure makes an AI system secure; robustness comes from layers, in the defence-in-depth tradition, precisely because each individual defence is incomplete and any one can be breached. The layers reinforce one another: protecting the data supply chain blunts poisoning; hardening the model raises the cost of evasion; controlling access to the model limits extraction and probing; validating inputs and isolating untrusted content constrains injection; monitoring behaviour detects attacks in progress; and blast-radius containment ensures that an attack which gets through can do only bounded harm. The strength of the whole is that an attacker must defeat many layers rather than one, and that the failure of any single layer does not mean the failure of the system. This is also why security cannot be a one-time effort: attacks evolve, new vulnerabilities emerge, and defences that suffice today may not tomorrow — especially for young, fast-moving areas like agentic systems and prompt injection, where the research is immature and no defence is yet complete. Security maturity in AI is not the belief that the system has been made impregnable; it is the discipline of making attacks costly, detecting them quickly, containing their effects, and continuously updating defences as the threat landscape shifts.

Security belongs to governance

The final point ties security back to the rest of the course: in regulated AI, security is not a separate concern handled by another team after the model is built — it is part of whether the system can be trusted to do its job, and therefore part of governance. The threads are explicit. Validation must include adversarial robustness, because a system never probed for how it fails under attack has not been fully validated. The threat model belongs in documentation, so the system's security assumptions and limitations are recorded and assessable. Incident response must contemplate security incidents alongside performance failures. And the accountable owner owns security risk along with every other risk the system carries. A regulator examining a high-risk AI system increasingly asks how it is protected against manipulation, and the answer cannot be "we secured the servers" — that addresses the infrastructure, not the model. "How do you know your model has not been poisoned, evaded, or hijacked?" is a governance question, and answering it requires the security discipline of this part woven into the framework of the whole.

In the next part: deployment, change management, and versioning — moving validated systems into production safely, and controlling the changes that follow.

← Previous lesson · Next lesson →