Testing and Validation of AI Systems (Building Regulated AI: From Principles to Production)

Validation is the moment of truth in the model lifecycle — the point at which someone independent of the builders asks, rigorously and sceptically, whether this system is actually fit to make the decisions it is about to make. It is the central control of model risk management and the practice that most directly determines whether a system should be trusted. This part covers what validation must examine in an AI system, why independence is non-negotiable, and why — like everything in regulated AI — validation is a continuous discipline rather than a one-time gate.

Validation is not testing

Developers test; they check that the system does what they built it to do. Validation is broader and more adversarial: it asks whether the system is fit for its real purpose, including the ways it might fail, the conditions its builders did not consider, and the uses it will face in the world rather than the lab. A model can pass every test the developer devised and still fail validation, because the developer tested the system they imagined, while the validator probes the system as it will actually be used and abused. The mindset difference is the point: the validator's job is to find the reasons not to trust the system.

Testing asks "does it do what we built it to do?" Validation asks "should anyone rely on it, and where will it break?"

The independence principle

Validation's power comes from independence. The validator must be separate from, and not beholden to, the development team — different reporting lines, no stake in the outcome, freedom to deliver bad news. The reasons are human, not technical: builders are too close to their own work to see its flaws, are invested in its success, and share the blind spots that shaped it. An independent validator brings fresh eyes and, crucially, the freedom to say "this is not ready" without it costing them. Validation performed by the development team, or by someone who reports to it, is not validation; it is testing with a grander name. The degree of independence should scale with risk — the highest-stakes systems warrant validators fully separated from the business, often in the second line of defence.

What to validate

AI validation must go well beyond "is it accurate?". A thorough validation examines several dimensions, each tied to material from earlier parts.

Conceptual soundness

Is the approach itself sensible? Does the choice of model, data, and features make sense for the problem? A model can be technically well-executed and conceptually misguided — solving the wrong problem, or relying on data that will not be available or reliable in production.

Performance, honestly measured

How well does the system perform, measured on data that genuinely represents production conditions rather than a convenient test set? Validators probe for the subtle ways performance gets overstated — data leakage, overfitting, evaluation on data too similar to training — and insist on metrics appropriate to the decision, not just headline accuracy.

Performance across subgroups

Aggregate performance hides subgroup failure. Validation must examine how the system performs across the different populations it will serve, connecting directly to the fairness testing of Part 9. A model with strong average performance and poor performance for a minority group is not validated until that disparity is surfaced and addressed.

Robustness and stability

How does the system behave under stress — unusual inputs, edge cases, noisy or adversarial data, conditions that differ from training? A system that performs well on typical cases but collapses on atypical ones is fragile, and validation must map that fragility so it can be managed.

Explainability and soundness of reasoning

Validation should confirm that the system can be explained to the degree its risk tier requires, and that its reasoning is sound rather than relying on spurious correlations. A model that achieves accuracy by latching onto an artefact of the data — a quirk that will not hold in production — is a validation failure even if its numbers look good.

Limitations and failure modes

Perhaps most importantly, validation should produce a clear-eyed account of where the system fails and should not be trusted. This feeds directly into the limitations documentation of Part 11 and into the conditions placed on the system's use.

Validation produces decisions, not just reports

Validation is not an academic exercise that ends in a document filed away. It produces a decision: is the system fit to deploy, fit to deploy only under conditions, or not fit at all? A validator who only ever produces favourable reports is not exercising effective challenge. Genuine validation sometimes says no, or says "yes, but only for these uses, with these controls, and these limitations" — and the organisation must have the discipline to honour those conditions rather than quietly ignoring them once the system is live. The conditions a validator attaches are themselves controls, and they belong in the audit trail.

Validation is continuous

The most important shift from traditional thinking is that validation is not a gate the system passes once on its way to production. Because AI systems drift as the world changes, a system validated at launch can become unfit without any code changing at all. Validation must therefore be ongoing: periodic revalidation on a schedule appropriate to the system's risk and volatility, and triggered revalidation whenever something material changes — the data distribution shifts, the use expands, performance degrades, or the world the model assumes is disrupted. This continuous validation is the natural partner of the monitoring discipline covered in a later part: monitoring detects when something has changed, and revalidation determines whether the system is still fit in light of the change.

A model is not validated once and forever. It is validated as of a date, under conditions that expire.

Validating what you did not build

An increasingly common and difficult case is validating a system built on a model you did not create and cannot fully inspect — a third-party or foundation model. Traditional validation assumes access to the model's internals, which you may not have. This does not excuse you from validation; it changes its shape, shifting weight toward behavioural testing, vendor due diligence, and rigorous assessment of the model in your specific context of use. We return to this challenge in the dedicated part on third-party and foundation-model risk, but the principle holds: you cannot outsource accountability for a decision to a vendor whose model you chose to rely on.

Validation as the keystone

Validation sits at the centre of the governance arch, drawing together data quality, fairness, explainability, robustness, and documentation into a single independent judgement about trust. It is where the abstract principle "can we defend this system?" becomes a concrete, evidenced answer. A strong validation function — independent, rigorous, empowered to say no, and operating continuously — is perhaps the clearest signal that an organisation's AI governance is real. As we move into the agentic systems that follow, validation's challenge intensifies, because systems that plan and act over many steps are far harder to validate than systems that make a single prediction.

The ways performance gets overstated

A validator's most important instinct is suspicion of good numbers, because performance is overstated far more often through honest error than through dishonesty. A few patterns recur so reliably that probing for them is core validation craft.

Data leakage. Information that will not be available at decision time, or that effectively encodes the answer, sneaks into the training features — and the model posts spectacular results that collapse in production. Leakage is subtle and common; a model that seems too good usually is.
Overfitting to the test set. When a team tunes against the same test data repeatedly, the test set quietly stops being an honest measure and starts being part of the training, inflating reported performance.
Unrepresentative evaluation data. Performance measured on data too clean, too easy, or too similar to training overstates how the system will do on the messy reality of production.
Misleading metrics. A headline accuracy can look excellent while the model fails badly on the cases that matter — the rare-but-costly ones a single aggregate number conceals.

The validator's job is to assume good numbers are wrong until proven otherwise, and to probe each of these. A development team reporting strong results is not lying; it is usually just too close to its own work to see how the results were inflated. That distance is exactly what independent validation supplies.

If a model's performance looks too good, it almost always is. The validator's first question is not "how did they do it?" but "how is this number lying to me?"

Validating behaviour, not just accuracy

Thorough validation goes well beyond a performance figure to probe how the system actually behaves, especially under stress. Robustness testing pushes the system with unusual, noisy, edge-case, and adversarial inputs to map where it becomes unreliable — because a system that excels on typical cases and collapses on atypical ones is fragile in exactly the situations that produce incidents. Subgroup testing examines performance across the populations the system serves, surfacing the disparate failures that aggregate metrics hide. Reasoning-soundness checks ask whether the model achieves its results for sound reasons or by latching onto spurious correlations that will not hold in production. And stability testing examines whether small changes in input produce sensibly small changes in output, or wild swings that signal an unreliable model. None of these is captured by accuracy, and all of them bear directly on whether the system can be trusted — which is why validation is a far broader exercise than testing whether the model hits its numbers.

The validator's mandate to say no

Validation is worth nothing if its only possible output is approval. A validation function that has never rejected a system, or never attached binding conditions, is not exercising effective challenge — it is performing a ritual. Genuine validation produces real decisions: fit to deploy, fit only under specified conditions and limitations, or not fit at all. And the organisation must honour those outputs, which is harder than it sounds: when a business has invested in a system and is eager to ship, a validator's "not yet" or "only for these uses" creates friction that there will be pressure to dissolve. The integrity of the whole control depends on the validator having genuine authority to say no, the independence to say it without personal cost, and an organisation disciplined enough to treat conditions as binding rather than advisory. The conditions a validator attaches are themselves controls, and quietly ignoring them once the system is live is a governance failure that the audit trail should make visible.

Validation across the lifecycle, not at a point

The deepest shift validation requires is from thinking of it as a one-time gate to treating it as a continuous discipline. Because AI systems drift, a system validated at launch can become unfit without a single line of code changing — so validation must recur. This takes two forms working together. Scheduled revalidation reassesses the system periodically on a cadence set by its risk and volatility, regardless of whether anything has obviously changed. Triggered revalidation fires when something material does change: significant drift, performance degradation, fairness deterioration, a shift in the data, or an expansion in use. The two are complementary — scheduled revalidation catches slow erosion that no single trigger would flag, while triggered revalidation responds to discrete events between scheduled reviews. Together they realise the principle that a model is validated as of a date, under conditions that expire, and they bind validation tightly to the monitoring discipline that detects when those conditions have shifted.

In the next part: agentic AI — the governance challenge of systems that plan and act autonomously over multiple steps, and how to let them be useful inside a boundary you can define and defend.

← Previous lesson · Next lesson →