If you fine-tune on data you did not fully vet, you may be training your own backdoor. We explain how training data poisoning works, why it is so hard to detect after the fact, and where the defenses actually live.
Most of the attacks we write about happen at runtime, against a finished model. This one happens earlier, before the model is even done, and that is what makes it so durable. Training data poisoning is the deliberate corruption of the data a model learns from, so that the finished model carries the attacker's influence baked into its weights. By the time the model is deployed, there is no malicious prompt to catch and no payload to scan, because the compromise is no longer in any file or any input. It is in what the model learned. If you fine-tune on data you did not fully control, this is a risk you own whether you have considered it or not.
OWASP lists Data and Model Poisoning in its Top 10 for LLM Applications, and it belongs there, because the supply chain for training data is wide, messy, and rarely audited.
How poisoning gets in
The attack surface is the data pipeline, and it is larger than most teams realise.
Pretraining and web-scraped data. Foundation models learn from enormous corpora scraped from the open internet. An attacker who can plant content where it will be collected, on pages, in repositories, in public datasets, can attempt to influence what models trained on that data absorb. You inherit this risk in any model you build on.
Fine-tuning data. This is where most organisations have direct exposure. When you fine-tune on a dataset, you are letting that dataset shape the model's behaviour. If the data came from an unvetted source, a third-party dataset, user-contributed content, a scraped collection, poisoned examples in it can install behaviours you never intended.
Feedback and continuous learning. Systems that learn from user interactions over time can be poisoned by users who feed them crafted inputs, gradually steering the model, a slow-motion cousin of the memory-poisoning attacks we cover for agents.
What poisoning produces
The most concerning outcome is a backdoor: the model behaves normally almost all the time, but produces the attacker's chosen output when it sees a specific trigger. Because it is well-behaved on everything except the secret trigger, it passes evaluation, your tests do not contain the trigger, and ships looking clean. We explored this exact dynamic in our model-file series under conditional output tampering, and the honest conclusion there applies here: a pure behavioural backdoor learned into the weights cannot be reliably found by inspecting the finished model, because the trigger is, by design, an input you do not have.
Poisoning can also produce subtler harms, degraded accuracy, injected bias, or a tendency toward specific failures, without a clean trigger at all.
The hard truth about detection
We are not going to pretend this is fully solvable after the fact, because it is not, for anyone. Once a model has been trained on poisoned data, static inspection of the resulting weights cannot guarantee the backdoor is found. This is a property of the problem, not a gap in one product. Which means the defense cannot wait until the model is finished. It has to live where you still have control: the data, the provenance, and the process by which both turn into a model.
This is the same principle that governs our whole approach to model risk. You cannot inspect your way to safety on a finished artifact whose history you do not know. You have to know where it came from.
Where the defenses actually live
- Vet your data sources. Treat training and fine-tuning data with the same scrutiny as any other dependency. Unvetted datasets are unvetted code that happens to run during training.
- Control provenance. Know where your data came from and maintain the ability to trace it. Provenance is the control that survives when inspection cannot.
- Be cautious with continuous learning. Systems that learn from user input need guardrails on what they are allowed to absorb, or users become an unvetted training pipeline.
- Test adversarially, and assume residual risk. Red-team the finished model for the failure modes poisoning produces, while accepting that a clean test is a floor, not a certificate.
- Mind the model supply chain too. A pretrained or third-party model you build on carries whatever was in its training data. Provenance of the base model matters as much as your own data.
Frequently asked questions
If I only use a major provider's model, am I exposed? You inherit the provenance of that model's training data, which is largely out of your hands, so base-model provenance and reputation matter. Your direct, controllable exposure is highest when you fine-tune, because that data is yours to vet.
Can a scanner detect a poisoned model? A scanner can catch poisoning implementations that leave structural or executable fingerprints, and many do. A pure behavioural backdoor with no such tell cannot be reliably caught by static analysis, which is exactly why data vetting and provenance, not just post-hoc scanning, are the real controls.
Is this the same as memory poisoning? They are relatives. Memory poisoning corrupts an agent's runtime memory and can be addressed at write time; training poisoning corrupts the model's learned weights and has to be addressed at the data and process level. Both reward controlling what the system is allowed to absorb.
How Promptention helps
Our position on model risk is consistent from the training pipeline to runtime: you defend it by knowing provenance and by testing adversarially, not by trusting a finished artifact you cannot fully inspect. Our red teaming probes deployed models for the behavioural failures and triggers that poisoning produces, our Model Scan work hardens the model-file side of the supply chain, and the guidance we give teams centres on vetting data sources and controlling provenance, the controls that actually live where this attack does. A poisoned model can look perfect. We help you stop trusting "looks perfect."
Promptention's red teaming and supply-chain scanning support defense against data and model poisoning, aligned to OWASP LLM04.
