Privacy, Lawful Basis, and Data Minimisation (Building Regulated AI: From Principles to Production)

The moment an AI system touches data about people — and almost all consequential ones do — it enters the domain of data-protection law. This regime, exemplified by GDPR and echoed in privacy laws worldwide, is among the most demanding the system will face, and it constrains not just how you deploy a model but how you train it, because training is itself processing of personal data. This part covers the privacy obligations that most directly shape AI, and how to build for them rather than around them.

Privacy is not secrecy

A foundational misconception is that privacy means keeping data secret. It does not. Privacy law is about the lawful, fair, and transparent handling of personal data — about respecting people's rights over information about them. Data can be entirely non-secret and still be subject to strict privacy obligations. Reframing privacy from "keep it hidden" to "handle it lawfully and fairly" is the first step to building systems that respect it, because it directs attention to the right questions: do we have a basis, are we being transparent, are we honouring rights?

Lawful basis: the gating question

The cornerstone of data-protection law is that you may not process personal data without a lawful basis — a legitimate, legally recognised justification established before you begin. For AI this question arises twice and must be answered separately each time: once for training the model on personal data, and once for deploying it to make decisions about people. Teams routinely think about the deployment basis and forget the training one, then discover that the very foundation of their model rests on unlawful processing.

Two questions, not one: what is our lawful basis for training on this data, and what is our lawful basis for using the model to decide about this person?

The available bases vary by regime, but the discipline is constant: identify the basis, document it, and ensure the actual use stays within it. A basis chosen for one purpose does not stretch to cover others. And some bases carry strings — where consent is the basis, it must be freely given and revocable, which has real architectural implications: you must be able to honour withdrawal.

Purpose limitation

Closely tied to lawful basis is purpose limitation: data collected for one purpose may not be freely repurposed for another incompatible one. This is the legal teeth behind the purpose-creep warning from the previous part. The instinct to reuse a rich dataset for a promising new model is natural and frequently unlawful. Each materially new purpose requires its own assessment, and "we already had the data" is never, by itself, a justification for a new use. Building this boundary into data governance — so the pipeline knows what each dataset may be used for and enforces it — turns a legal principle into an operational control.

Data minimisation

Data-protection law requires that you process only the data that is adequate, relevant, and limited to what is necessary for your purpose. This minimisation principle runs directly against a deep instinct in machine learning, where the reflex is to hoard every feature on the theory that more data means better models. In regulated AI that reflex is a liability. Every additional field of personal data you collect and retain is additional risk — a larger attack surface, more to account for, more to protect, more to delete on request. The discipline is to collect and keep the minimum that achieves the outcome, and to be able to justify each category of data you do hold. "We collected it because it might be useful someday" is precisely the reasoning minimisation forbids.

Minimisation also has a temporal dimension: retention limits. Data should not be kept indefinitely but only as long as the purpose requires, after which it is deleted. Models complicate this, because a model trained on data may "remember" it even after the source records are deleted — a subtlety that connects to erasure rights below.

Individual rights

Privacy regimes grant individuals enforceable rights over their data, and an AI system must be built to honour them, not merely promise to. The rights that most affect AI include:

Access. People can ask what data you hold about them and how it is used. Answering requires the lineage and cataloguing from the previous part.
Rectification. People can have inaccurate data corrected — which may mean correcting data a model has already learned from.
Erasure. People can, in defined circumstances, have their data deleted. This is genuinely hard for AI: deleting source records does not necessarily remove their influence from a trained model, and honouring erasure may require retraining or other mitigations. Systems that cannot in principle honour erasure are storing up trouble.
Objection. People can object to certain processing, including profiling, which the system must be able to act on.

The architectural implication is that these rights cannot be afterthoughts. A system that has no way to locate, correct, or remove an individual's data — or to account for its influence — is not merely incomplete; it is non-compliant by construction.

The special case of automated decisions

Data-protection law reserves particular attention for solely automated decisions that significantly affect people — exactly the decisions regulated AI exists to make. The protections typically include a right not to be subject to such decisions in some circumstances, a right to meaningful information about the logic involved, and a right to obtain human intervention and to contest the outcome. These provisions tie privacy directly to the explainability material of Part 6 and the human-oversight material ahead: the "meaningful information about the logic" is an explanation, and the "human intervention" is a human-in-the-loop control. The same architectural choices that deliver explainability and oversight also discharge these privacy duties — a clear example of the design-for-overlap principle from Part 2.

Privacy-enhancing approaches

A growing toolkit lets you reduce privacy risk while preserving utility, and reaching for it is a sign of maturity: techniques that anonymise or pseudonymise data so individuals cannot be readily identified, methods that add mathematical noise to protect individuals while preserving aggregate patterns, and architectures that bring computation to the data rather than centralising sensitive records. None is a silver bullet — anonymisation can be reversed if done carelessly, and every technique trades some utility for protection — but used judiciously they let you build capable systems with a smaller privacy footprint. The key is to treat them as deliberate design options assessed against your obligations, not as magic words that make privacy concerns disappear.

Privacy by design

The unifying idea, and a legal requirement in many regimes, is privacy by design: build privacy protections into the system from the outset rather than appending them later. Every architectural choice we have discussed — minimising what you collect, tracking purpose, enabling rights, protecting automated-decision subjects — is cheaper and more robust when designed in than when retrofitted. Privacy by design is the data-protection expression of the course's central theme: the controls regulators ask about should be structural features of the system, not documents written after it was built.

The hard problem of erasure in trained models

Among the individual rights, erasure poses the deepest technical challenge for AI, and it rewards a closer look because it exposes how privacy law and machine learning genuinely collide. When a person exercises a right to have their data deleted, deleting the source records is the easy part. The harder question is what to do about the model that already learned from that data. A trained model does not store its training examples verbatim, but it can encode information about them — sometimes enough that the data's influence, or even fragments of the data itself, can be recovered. Deleting the source record does not necessarily remove that influence.

There is no single clean answer, and the right response depends on the system's risk and the sensitivity of the data. Options range from retraining the model without the erased data, to techniques designed to remove specific data's influence from a model, to architectural choices that limit how much any individual's data shapes the model in the first place. What matters for our purposes is the design implication: a system built with no thought to erasure may be unable to honour it even in principle, which is a compliance failure baked into the architecture. The time to think about erasure is at design, not when the first request arrives.

"We deleted their record" is not the same as "we removed their influence from the model." Privacy law increasingly cares about the second, and only design can deliver it.

The two-basis discipline in practice

The point that AI needs a lawful basis for both training and deployment is easy to state and easy to forget, so it is worth a concrete illustration. A firm holds customer transaction data, lawfully, to operate accounts. It now wants to train a fraud model on that data and deploy it to make decisions about customers. Three distinct questions arise, not one: is operating the accounts a lawful basis for the original holding (yes); is training a fraud model a compatible use of data collected to operate accounts (a separate question requiring its own analysis); and is making automated fraud decisions about a customer lawful and within the protections for significant automated decisions (a third question)? Teams routinely answer only the third, assume the first covers everything, and never confront the second — the purpose-compatibility of training. That gap is where well-intentioned projects drift into unlawful processing. The discipline is to ask all three questions explicitly, document each answer, and ensure the actual use stays within the basis claimed.

Minimisation against the hoarding instinct

Data minimisation runs against one of machine learning's deepest reflexes — the belief that more data is always better — and the tension is worth confronting directly rather than papering over. The reflex is not entirely wrong: more relevant data often does improve models. But in regulated AI, every additional category of personal data carries a standing cost that the modelling view ignores: a larger attack surface, more to secure, more to account for under access and erasure rights, more to delete under retention limits, and more to explain when a regulator asks why you hold it. The discipline is to weigh the marginal modelling benefit of each data category against its standing privacy and security cost, and to justify each category you keep. "It might be useful someday" is exactly the reasoning minimisation forbids, because someday-useful data is certainly-costly data in the meantime. The mature posture treats every field of personal data as something you must affirmatively justify holding, not something you keep by default until forced to drop it.

Privacy and the overlap dividend

A reassuring theme connects this part back to the landscape: the controls that satisfy privacy's automated-decision protections are largely the same controls the rest of the course already requires. The "meaningful information about the logic" a data subject is owed is the explanation capability of Part 6. The "human intervention" they can request is the human oversight of Part 10. The lineage that lets you honour access and erasure is the data governance of Part 7. Privacy is demanding, but for a system already being built to be explainable, overseen, and well-governed, much of the privacy burden is discharged by controls that exist for other reasons. This is the overlap dividend in action — design for several obligations at once and each becomes cheaper. It is the strongest practical argument for the integrated, governance-first approach this course advocates over treating each regime as a separate compliance silo.

In the next part: fairness and bias — how unfair outcomes arise, how to measure them across groups, and the techniques and trade-offs involved in mitigating them.

← Previous lesson · Next lesson →