Deployment, Change Management, and Versioning (Building Regulated AI: From Principles to Production)

A model that has been validated and approved is not yet a model in production, and the journey between those two states is one of the most underestimated sources of risk in the whole lifecycle. The system that ran in a data scientist's notebook and the system that runs in production are connected by a chain of engineering — data pipelines, serving infrastructure, integrations — at every link of which behaviour can subtly change. And once live, the system will not stay still: data shifts, requirements evolve, and the model will be retrained and updated, each change an opportunity for something to go wrong. This part is about controlling deployment and change so that what runs is always what was approved, and so that changes are deliberate rather than accidental.

The deployment gap

The first discipline is ensuring that the deployed system genuinely matches the validated one. This sounds trivial and is not. Models are validated in one environment and deployed in another; data that was clean in training arrives messy in production; a feature computed one way offline is computed slightly differently online; a preprocessing step is reimplemented and subtly diverges. Each discrepancy can change behaviour, and because the changes are small and silent, they often go unnoticed until they cause a problem. The phenomenon is common enough to have a name in practice — training/serving skew — and it means a system can pass validation and then behave differently in production through no deliberate change at all.

The model you validated and the model you deployed are the same only if you have taken deliberate steps to make them so. By default, they drift apart.

Guarding against the gap requires treating the deployed system, not just the model, as the unit of validation: validating in conditions that mirror production, reusing the exact same data-processing code in training and serving where possible, and testing the deployed system end to end before it makes real decisions. The principle is that approval attaches to a complete, deployed configuration, not to a model artefact in isolation.

Versioning everything

Because AI systems change over time and because you must be able to reconstruct any past decision (Part 6) and explain it (Part 11), versioning is foundational. And in AI, versioning means more than versioning code. A decision's behaviour depends on several things that must each be versioned and linked:

The model — the specific trained artefact, so the exact model that made a past decision can be retrieved.
The data — the dataset the model was trained on, so training is reproducible and provenance is clear.
The code — the training and serving code, including all the preprocessing that shapes inputs.
The configuration — the parameters, thresholds, and settings that govern behaviour, which can change outcomes as much as the model itself.

With everything versioned and linked, you can answer the essential question for any decision: exactly which model, trained on which data, with which code and configuration, made this decision on this date? Without it, that question has no reliable answer, and explainability, validation, and audit all collapse. Reproducibility is the goal — the ability to recreate precisely how any given decision was produced — and versioning is how you achieve it.

Change management

Every change to a live AI system carries risk, and so changes must be managed rather than made ad hoc. The core idea, borrowed from mature operational practice, is that significant changes go through a controlled process proportionate to their risk: assessed, tested, approved, and recorded before they reach production. For AI this applies to a wide range of changes, some of which teams do not initially recognise as changes at all:

Retraining — even retraining on fresh data with the same approach produces a new model that may behave differently and, for high-risk systems, may warrant revalidation before it goes live.
Model updates — new architectures, features, or approaches are substantial changes requiring full assessment.
Configuration changes — adjusting a threshold can change who is approved or declined as dramatically as a new model, yet such changes are often made casually. They deserve the same control.
Data and pipeline changes — changes to data sources or processing can alter behaviour even with the model untouched.
Use changes — expanding a system to a new population or purpose can change its risk tier and re-open classification, validation, and the obligation map.

The discipline is to recognise each of these as a change, route it through assessment proportionate to its risk, and record it. A particular trap is the "minor tweak" — a configuration adjustment or quick retrain made outside the change process because it seemed too small to bother with — which is exactly how unreviewed changes accumulate into a system that no longer resembles the one that was approved.

Deploying changes safely

Even an approved change should reach production in a way that limits the harm if it turns out to be wrong. Established techniques from software operations apply directly and are worth adopting:

Staged rollout. Release a change to a small fraction of decisions first, observe its real behaviour, and expand only if it performs as expected — so a bad change harms few rather than all.
Parallel running. Run a new model alongside the old without acting on its outputs, comparing the two, to gain confidence before the new one takes over.
Fast rollback. Maintain the ability to revert to the previous version quickly if a change misbehaves. Versioning makes this possible; without it, rollback is a rebuild.
Heightened monitoring around changes. Watch a freshly changed system more closely, since changes are when problems are most likely to appear.

Deployment and change as governance

It can be tempting to treat deployment and change as engineering concerns separate from governance, but they are inseparable from it. The validation that approved a system applies to a specific version; a change can invalidate that approval, which is why change management ties back to continuous validation (Part 12). The audit trail must record changes alongside decisions, so the history of the system's behaviour can be understood in light of how it evolved (Part 11). And the accountable owner owns the system through its changes, not just at its launch (Part 4). A regulator examining a system will ask not only how it was validated but how it has changed since, who approved those changes, and how you ensure the live system still matches what was approved. The firm that can answer — with a versioned history and a change record — demonstrates control; the firm that cannot reveals that its governance stopped at launch. As the next part makes clear, controlling change is only half the job; you must also watch the system continuously, because the most consequential changes are the ones the world makes to your data without asking.

The skew that no one changed

Training/serving skew deserves a concrete illustration, because it is the kind of failure that occurs without anyone making a decision to cause it. A model is trained offline, where a particular feature — say, a customer's average transaction value — is computed over a clean historical window. In production, the same feature is computed by a different piece of code, over a slightly different window, handling missing values slightly differently. Nothing was changed maliciously or even carelessly; two teams simply implemented the same idea twice. But the model now receives subtly different inputs than it was trained on, and its behaviour shifts — sometimes trivially, sometimes materially. The model that was validated and the model that is deciding are, in effect, no longer the same system, and no change record will show it, because nobody changed anything they recognised as the model. This is why approval must attach to a complete, deployed configuration rather than a model artefact, why the same data-processing code should be reused across training and serving wherever possible, and why end-to-end testing of the deployed system — not just the model in isolation — is essential before it makes real decisions.

The most dangerous changes are the ones nobody made on purpose. A model can drift from its validated self without a single deliberate edit.

The "minor tweak" that wasn't

The change most likely to escape proper management is the one that seems too small to bother with. A threshold nudged a few points to hit a business target; a quick retrain on fresher data "to keep things current"; a configuration value adjusted to fix an edge case. Each feels minor, and each is made outside the change process precisely because it feels minor — and that is exactly how a system quietly diverges from the one that was approved. A threshold change can move who is approved or declined as dramatically as a whole new model, yet it carries none of the ceremony, so it slips through. The discipline is to recognise that, for an AI system, configuration and retraining are not lesser changes but first-class ones, because they can alter behaviour and outcomes as much as a model swap. The remedy is not to make every tweak heavy, but to route each change — including the small-seeming ones — through assessment proportionate to its actual risk, so the threshold change gets the scrutiny its impact warrants rather than the scrutiny its apparent size suggests. Unmanaged "minor" changes are how governance erodes one reasonable-seeming step at a time.

Versioning for reproducibility

Everything in regulated AI that depends on reconstructing the past — explaining a decision, validating a model, investigating an incident — depends on reproducibility, and reproducibility depends on versioning more than code. To recreate exactly how a given decision was produced, you must be able to retrieve the specific model that made it, the data that model was trained on, the training and serving code including all preprocessing, and the configuration in force at the time — and you must be able to link them, so that for any decision you can answer: which model, trained on which data, with which code and configuration, decided this, on this date? When all of these are versioned and linked, that question has a precise answer and the past is reconstructable. When any of them is not, the answer becomes guesswork, and explainability, validation, and audit all degrade together. Versioning is unglamorous infrastructure, but it is the substrate on which much of the rest of the discipline stands; a decision you cannot reproduce is a decision you cannot truly explain or defend.

Deploying changes so failure is survivable

Even a properly assessed and approved change can turn out to be wrong, so it should reach production in a way that limits the harm if it does. The techniques are borrowed wholesale from mature software operations and apply directly. A staged rollout releases the change to a small fraction of decisions first, so a bad change harms few rather than all before its problem becomes visible. Parallel running operates the new version alongside the old without acting on its outputs, building confidence through comparison before the new version takes over real decisions. Fast rollback — made possible by the versioning above — allows a quick return to the previous known-good version when a change misbehaves, turning a potential incident into a brief blip. And heightened monitoring around any change focuses extra scrutiny exactly when problems are most likely to appear. Together these convert deployment from a moment of risk into a controlled, reversible process — which matters because, as the next part insists, even a perfect deployment is only the beginning of a system's life, and the world will keep changing the system's data long after the last deliberate change was made.

In the next part: monitoring, drift, and continuous validation — watching a live system so that degradation is caught early rather than discovered through harm.

← Previous lesson · Next lesson →