Private data, untrusted content, and the ability to communicate out. Any agent with all three can be turned into a data-exfiltration tool by a single injected instruction, and the model layer cannot reliably stop it. Here is why, and what to do instead.
There is a framing of agent risk that has become hard to argue with, because it is less a theory than an observation about how these systems are built. Independent researcher Simon Willison named it in mid-2025: the lethal trifecta. When an AI agent has all three of the following at once, it can be turned into a data-exfiltration tool by a single piece of malicious text:
- Access to private data — it can read something worth stealing.
- Exposure to untrusted content — an attacker can get text in front of it.
- The ability to communicate externally — it can send data out.
The uncomfortable part is that these three are not exotic capabilities. They are the reason you deployed the agent. You gave it your data so it could be useful, you pointed it at the web or your inbox so it could work on real inputs, and you let it call APIs and send messages so it could actually do things. The trifecta is not a misconfiguration. It is the product.
Why the model cannot simply refuse
The root cause is structural, not a tuning problem. A language model follows instructions in the content it reads. That is the entire reason it is useful. But it cannot reliably tell your instructions from instructions an attacker planted in a document, a web page, an email, or an image it was asked to process. To the model, all of it is just text in the context window, and text that says "do X" tends to get X done.
So when an agent with the trifecta ingests a malicious instruction hidden in some untrusted content, it can read your private data and send it to the attacker, using your own permissions, while believing it is completing the task you gave it. There is no malformed input to reject and no exploit in the classic sense. The agent worked exactly as designed.
This is why a credible position has emerged across the industry that prompt injection may never be fully "solved" at the model level for agents that browse and act. That is not defeatism. It is a clear statement about where the defense has to live.
The defense is to break a leg of the trifecta
If you cannot make the model immune, you make the system safe by ensuring the three capabilities never combine without a control in between. Concretely:
Cut the untrusted-content leg. Scan and evaluate everything entering the agent's context, retrieved documents, tool outputs, web content, emails, for injected instructions, before it can influence the agent's plan. This is the leg most amenable to active defense, because it is the attacker's entry point.
Cut the exfiltration leg. Constrain where the agent can send data. An agent that physically cannot make arbitrary outbound requests cannot leak to an attacker's server, no matter what instruction it absorbed. Egress filtering and a strict allowlist of destinations are unglamorous and extremely effective.
Cut the private-data leg. Scope what the agent can read to what the task needs. Redact sensitive fields before they enter the context. If the data is never in front of the agent, it cannot be exfiltrated from it.
You rarely cut all three. You do not have to. Breaking any single leg, reliably, defangs the attack, and breaking two gives you defense in depth.
What this means in practice
The lethal trifecta is the clearest argument we know of against the hope that a more capable, better-aligned model will eventually make agent security a non-issue. Capability is the attack surface. The same connectivity that makes an agent valuable makes it exploitable, and you cannot align your way out of a structural property.
The teams that get this right stop asking "is the model safe?" and start asking "where do the three legs of the trifecta meet in my system, and what control sits at that junction?" That is a question you can actually answer, audit, and improve.
Frequently asked questions
Can a better system prompt fix this? No. A system prompt is more text in the same context the attacker is injecting into. It raises the bar slightly and is trivially overridden by a determined injection. It is not a control.
Isn't restricting the agent the same as crippling it? Restricting capability cripples the agent. Restricting where untrusted content can reach and where data can flow does not. The goal is to keep the agent fully capable while ensuring the dangerous combination is mediated, not to take its tools away.
How is this different from classic data-loss prevention? Classic DLP inspects known patterns leaving known channels. The trifecta attack uses the agent's own legitimate channels and its own permissions, so the exfiltration looks like normal agent behavior. You have to evaluate intent at the input and constrain egress, not just match patterns at the exit.
Where Promptention fits
Our platform is built to sit on the two legs you can most effectively control: scanning untrusted content for injected instructions before it reaches the model, and helping enforce policy on what the agent is allowed to do and where it can send data. The model will keep following instructions in its context. Our job is to make sure the dangerous ones never get there, and that even if one does, it has nowhere to send the loot.
Promptention Guard provides real-time detection of prompt injection in the content flowing into your agents, independent of the underlying model.
Further reading: Simon Willison, "The Lethal Trifecta for AI Agents" (2025); OWASP Top 10 for Agentic Applications (2026).
