Data Governance and Lineage (Building Regulated AI: From Principles to Production)

Every model is a mirror of its data. It learns the patterns, the gaps, the biases, and the errors present in what it was shown, and it carries the legal status of that data with it. You can have the finest modelling team, the most rigorous validation, and the clearest explanations, and still build a system that is unfair, unlawful, or simply wrong — because the data underneath it was. Data governance is the discipline that makes the foundation sound, and lineage — the ability to trace every input from origin to decision — is its load-bearing element.

Garbage in, liability out

The old adage "garbage in, garbage out" understates the case for regulated AI. Bad data does not merely produce bad predictions; it produces liability. Train on data you had no lawful right to use, and your model is tainted at the root — regulators have ordered firms to delete models built on unlawfully obtained data. Train on data that encodes historical discrimination, and your model will reproduce and likely amplify it. Train on data with quality problems you did not detect, and your model will make confident decisions on flawed premises. In regulated AI, the data layer is not a preliminary; it is where much of your risk actually lives.

Most unfair, unlawful, or simply wrong AI outcomes are born upstream, in the data, long before a single model is trained.

The dimensions of data quality

Data quality is not one property but several, and a serious data-governance practice assesses each deliberately.

Accuracy. Does the data correctly reflect reality? Errors in labels are especially corrosive, because the model learns to reproduce them as truth.
Completeness. Are there missing values or absent records, and is their absence random or systematic? Systematically missing data — for example, less data about an under-served group — quietly bakes in bias.
Representativeness. Does the data reflect the population the model will actually serve? A model trained on a skewed sample fails, often invisibly, on the parts of the population it under-saw.
Timeliness. Is the data current enough for the decision? Stale data models a world that no longer exists.
Consistency. Do the same things mean the same across sources and time? Definitions that drift between systems introduce silent errors.

These dimensions are not abstract: each maps to a concrete failure mode you will be asked about. "How do you know your training data is representative?" is a question every high-risk system should have a documented answer to.

Lineage: the spine of data governance

Lineage is the ability to answer, for any piece of data the model used, three questions: where did it come from, how was it transformed, and where did it go? It is the documented path of data from its source system, through every cleaning, joining, and feature-engineering step, to the decision it ultimately informed. Lineage is the spine because so many other obligations depend on it.

Why lineage earns its cost

Building and maintaining lineage is real engineering effort, and teams resist it until they discover how many problems it solves:

Responding to individuals. When a person exercises a data right — to access, correct, or erase their data — you must know everywhere it flowed. Without lineage, you are guessing, and guessing wrong has legal consequences.
Investigating decisions. When a decision is challenged, lineage lets you trace exactly what data fed it and whether that data was sound. It is the difference between a confident answer and a shrug.
Retiring tainted data. If a data source turns out to be unlawful or flawed, lineage tells you precisely which models and decisions are affected, so you can remediate surgically rather than tearing everything down.
Reproducing and validating. Validation and debugging both require reproducing how a model was built. Lineage makes training data reconstructable instead of mythical.
Demonstrating provenance. When a regulator asks where your training data came from and whether you were entitled to use it, lineage is the evidence.

Provenance and the right to use data

Beyond quality, every dataset carries a question of entitlement: were you allowed to collect it, and are you allowed to use it for this purpose? This is where data governance meets the privacy and lawful-basis material of the next part. Provenance — a documented record of where data originated, under what terms, and with what permissions — must be established before you build on the data, not reconstructed afterward under pressure. Data scraped, purchased, or repurposed without attention to provenance is a latent liability that can detonate at any time, including years after the model shipped. The discipline is to treat "do we have the right to use this for this?" as a gating question at ingestion, with a documented answer, rather than an awkward question raised at audit.

Purpose limitation and scope creep

A subtle but recurring trap is purpose creep: data gathered and lawfully held for one purpose gets quietly reused to train a model for another. This is one of the most common ways well-intentioned teams stray into unlawful processing. Data collected to service an account is not automatically fair game to train a marketing model; data gathered for fraud prevention is not automatically available for credit scoring. Each new use needs its own justification. Good data governance tracks not just what data you have but what you are permitted to do with it, and enforces that boundary in the pipeline rather than trusting everyone to remember.

Practical foundations

Translating these principles into practice rests on a few concrete capabilities:

A data catalogue that records what datasets exist, what they contain, where they came from, and what they may be used for — the data equivalent of the model inventory.
Automated lineage capture built into pipelines, so the path of data is recorded as a by-product of processing rather than documented by hand after the fact. Hand-maintained lineage rots; instrumented lineage stays true.
Quality checks as pipeline gates that test the dimensions above and refuse or flag data that fails, rather than discovering quality problems through the model's behaviour in production.
Provenance and permission metadata attached to datasets, so entitlement travels with the data and is checked at use.

Data governance as the quiet foundation

Data governance rarely gets the attention that models and algorithms attract, which is exactly why it is so often the weak point. The systems that fail regulatory scrutiny frequently fail not on the cleverness of their models but on the soundness of their data — provenance nobody can establish, representativeness nobody tested, lineage nobody maintained. Investing here is unglamorous and pays off invisibly, by preventing failures that would otherwise have been blamed on the model. As we turn next to privacy, remember that every privacy obligation ultimately reduces to a data-governance capability: you can only protect, account for, and lawfully use data you actually understand.

A worked example: the tainted data source

Picture a firm that discovers, two years after deployment, that a third-party data source feeding several of its models was collected without a proper basis and must no longer be used. The question that determines whether this is a manageable problem or an existential one is simple: which models and which decisions used that data?

A firm with real lineage answers in hours. It traces the source through its pipelines to the exact models trained on it and the exact decisions those models made, scopes the remediation precisely, retrains the affected models on clean data, and revisits the affected decisions. A firm without lineage faces a nightmare: it cannot say with confidence which models are contaminated, so it must treat everything the source might have touched as suspect, conduct a sprawling investigation, and explain to a regulator why it cannot even establish the scope of its own problem. Same underlying issue, wildly different consequences — and the difference is whether lineage was built in advance. This is the scenario that justifies lineage's cost more vividly than any abstract argument: it is insurance you are grateful for precisely when something has gone wrong.

Representativeness: the bias hiding in plain sight

Of the data-quality dimensions, representativeness deserves a deeper look, because its failures are both common and invisible to ordinary metrics. A model trained on data that under-represents part of the population it will serve learns that part less well and serves it worse — yet aggregate accuracy can look excellent, because the well-represented majority dominates the average. The failure is concentrated exactly where it is hardest to see and most likely to cause harm: the under-served minority.

Guarding against this requires deliberately examining the composition of training data against the population the system will actually serve, and measuring performance for subgroups rather than only in aggregate. The question "is our training data representative of everyone this system will affect?" should have a documented, evidenced answer for every high-risk system — and "the overall accuracy is high" is not that answer. Representativeness connects data governance directly to the fairness discipline two parts on: much of what manifests as unfairness at the model is, at root, a representativeness failure in the data.

Aggregate accuracy is an average, and averages hide the people the system serves worst. Representativeness is about who is in the average — and who was left out.

Instrumented lineage versus hand-maintained lineage

A practical warning underlies everything above: lineage that depends on people documenting data flows by hand will rot. Pipelines change, deadlines press, and manual records fall out of date precisely when the system is evolving fastest — which is when accurate lineage matters most. The only lineage you can trust under pressure is lineage captured automatically, as a by-product of the pipeline doing its work, so that the record of where data came from and how it was transformed is generated by the system rather than maintained by goodwill. This mirrors the audit-trail principle that recurs throughout the course: evidence that depends on humans remembering to record it will be incomplete exactly when you need it. Investing in instrumented, automatic lineage is unglamorous infrastructure work, and it is the difference between answering a data-provenance question with confidence and answering it with a shrug.

Data governance as a precondition, not a phase

It bears emphasising that data governance is not a stage you complete before "the real work" of modelling begins. It is a standing capability that runs underneath the entire lifecycle. The catalogue, the lineage, the quality gates, and the provenance metadata are live systems that must keep pace with every new data source and every pipeline change. Teams that treat data governance as a one-time clean-up before model training discover that their carefully governed foundation has quietly eroded by the time the model reaches production. Treating it as a precondition that is continuously maintained — rather than a phase that is completed — is what keeps the foundation sound for as long as the models built on it keep making decisions.

In the next part: privacy, lawful basis, and data minimisation — the specific obligations that govern personal data and how they shape what you may collect, keep, and learn from.

← Previous lesson · Next lesson →