Red Teaming AI Agents: Why Testing a Chatbot Isn't Enough

Red teaming a chatbot asks whether you can make it say something bad. Red teaming an agent asks whether you can make it do something bad, across tools, memory, and multiple steps. The second question is harder, and it is the one that matters now.

A lot of "AI red teaming" still means firing a few hundred jailbreak prompts at a model and counting how many make it say something it shouldn't. That was a reasonable test when the model's worst case was an embarrassing sentence. It is no longer sufficient, because the system under test no longer just talks. It acts. Red teaming a chatbot answers the question "can I make it say something bad?" Red teaming an agent has to answer a harder one: "can I make it do something bad?" Those are different exercises, and conflating them leaves the dangerous gap untested.

What changes when the target is an agent

A chatbot has one surface: the conversation. An agent has several, and the interesting failures live in the seams between them.

Tools. The agent can call functions, hit APIs, read and write files. The red-team question is not just "will it call a forbidden tool," but "can I compose permitted tools into something harmful," and "can I manipulate a tool's input or output to redirect the agent." A file-read plus a network-send is a benign pair until an injected instruction wires them into an exfiltration path.

Memory. The agent remembers. So the test has to include time-shifted attacks: can I plant something now that the agent acts on later, after the live prompt is gone? A single-turn jailbreak suite cannot find a poisoned-memory backdoor, because the payload and the damage are in different sessions.

Autonomy and multi-step plans. The agent decomposes a goal into steps and executes them with limited oversight. That means the red team has to think across a plan, not a turn: can I hijack the objective at step two so steps three through ten serve me, while each individual step looks fine?

Indirect, untrusted inputs. The agent ingests documents, web pages, tool outputs, other agents' messages. Every one of those is an injection vector that a prompt-only test never touches, because the attacker is not the user typing in the box.

Identity and permissions. The agent acts with credentials. The red team has to ask what those credentials can actually reach, and whether a hijacked goal plus an over-scoped token turns a small foothold into a large breach.

A red-team checklist for agentic systems

SurfaceWhat the red team should try
Goal integrityHijack the objective via injected content in any ingested source
Tool useCoerce misuse, chain permitted tools, tamper with tool I/O
MemoryPlant persistent false beliefs; collect on them in a later session
PermissionsMap what the agent's identity can reach; test least-privilege failures
Inter-agentForge or manipulate messages between cooperating agents
ExfiltrationFind a path for data to leave through a legitimate channel
Cascading failureTurn one bad decision into a propagating one

Notice how little of this a jailbreak corpus covers. Jailbreaks live in the first row, and only partially. The rest is where modern agentic breaches actually happen, which means a red team that stops at jailbreaks is testing the least consequential part of the system.

Methodology matters more than payload count

A weak red team measures itself by how many attack strings it threw. A strong one measures whether it found a path to impact. The difference is the difference between "we ran 10,000 prompts" and "we demonstrated that a document uploaded by any user could cause the agent to exfiltrate records it should never have been able to read." The second sentence is the one a security team can act on.

This is also why current datasets matter. Threats evolve; an agentic red team running last year's techniques against this year's system will produce a clean report and a false sense of safety. Adversarial testing has to track the live threat landscape, the new injection patterns, the new tool-abuse compositions, the new evasion tricks, or it is theatre.

Red teaming is not a one-time gate

The most common mistake is to treat red teaming as a launch checkbox: pass once, ship, forget. But the system changes, you add a tool, expand a permission, connect another agent, and the threat landscape changes underneath it regardless. Each of those is a new attack surface. Agentic red teaming is a continuous discipline, repeated as the system and the threats evolve, not a certificate you earn at release.

Frequently asked questions

Can automated tools replace a red team? They are a force multiplier, not a replacement. Automation is excellent at breadth, running large adversarial suites continuously. Finding a novel composition of permitted tools that reaches impact in your specific architecture still benefits from an adversary who reasons about the whole system, not just the model.

How is this different from a penetration test? A traditional pen test targets infrastructure and code. Agentic red teaming targets the decision-making and action layer: the model, its tools, its memory, its permissions, and the untrusted content it consumes. They are complementary; neither covers the other.

What is the most under-tested surface? Indirect injection through ingested content, and time-shifted memory attacks. Both are invisible to prompt-only testing, and both are where real agentic incidents tend to originate.

Where Promptention fits

Our Red Team Services are built for the action layer, not just the conversation: prompt injection and jailbreak simulation, tool-misuse and exfiltration paths, memory and multi-step attacks, and sensitive-data exposure, tested against your actual deployment with current adversarial datasets. The point is not a number. The point is to find the path to impact before someone else does, and to hand you something you can fix.

Promptention provides risk-based GenAI red teaming for agentic systems, mapped to the OWASP Top 10 for Agentic Applications and MITRE ATLAS.

Further reading: OWASP Top 10 for Agentic Applications (2026); MITRE ATLAS; NIST AI Risk Management Framework.