Not every attack steals data. Some just run up your costs or take your service down. Token flooding, denial of wallet, and model extraction through sheer volume are real. We cover the threat and how to bound it.
Most of the risks we write about end in stolen data or a hijacked action. This one ends in an invoice, or an outage. Unbounded Consumption is the LLM-era version of denial of service, with a modern twist: because inference costs real money per token, an attacker does not have to take your service down to hurt you. They can simply make it expensive. We have watched teams discover this the hard way, through a billing alert rather than a security one, and we want you to see it coming instead.
OWASP added Unbounded Consumption to its Top 10 for LLM Applications precisely because the economics of inference made an old attack class newly attractive.
Two ways this hurts you
Denial of service. The classic outcome. An attacker floods your LLM endpoint with requests, or with inputs engineered to be maximally expensive to process, very long contexts, prompts that trigger long generations, recursive or amplifying patterns, until the system degrades or falls over for everyone. If your LLM sits in front of a customer-facing product, that is a real outage.
Denial of wallet. The newer, nastier outcome. Even if your infrastructure scales and never goes down, every one of those expensive requests costs you. An attacker who cannot take you offline can still run your bill into the ground, turning your own elasticity against you. The more gracefully you scale, the more it costs to absorb the attack. We find this one catches teams off guard because their monitoring is built to alert on downtime, not on spend.
There is a third, quieter angle worth naming: extraction through volume. A determined party can query a model at scale to probe its behaviour, harvest its outputs, or approximate its capabilities, a form of model theft that hides inside ordinary-looking traffic. Bounding consumption limits this too.
Why it is harder than ordinary rate limiting
You might think a rate limit solves it, and a rate limit is part of the answer, but it is not the whole one, and here is the honest difficulty. LLM cost is not proportional to request count; it is proportional to work, tokens in, tokens out, context size, tool calls. A naive per-request limit lets an attacker stay under your request cap while sending requests that each cost a hundred times a normal one. So you have to meter the thing that actually costs money, the work, not just the number of calls. And you have to do it without throttling legitimate heavy users into a bad experience, which means the control has to understand the difference between expensive-and-legitimate and expensive-and-abusive. That nuance is where simple limits fall short.
What to do about it
- Meter the work, not just the requests. Enforce limits on tokens, context length, and output length, not only on request frequency.
- Cap the expensive dimensions. Set hard ceilings on input size and generation length so no single request can be arbitrarily costly.
- Budget per user and per tenant. Tie consumption to identity so one actor cannot spend the whole pool, and so anomalous spend is attributable.
- Alert on spend, not just on downtime. Monitor cost and consumption patterns in real time, and treat an abnormal spike as a security signal, not just an accounting one.
- Watch for extraction-shaped traffic. High-volume, systematic querying that looks like harvesting deserves attention even when each request is individually cheap.
Frequently asked questions
Won't autoscaling just absorb this? Autoscaling absorbs the availability problem and amplifies the cost problem. The better you scale, the larger the bill an attacker can generate. Scaling is not a substitute for bounding consumption; it changes which way the damage lands.
Is a simple rate limit enough? It is necessary but not sufficient. Because cost scales with work rather than request count, you also need limits on tokens, context, and output, and budgets tied to identity. Request-count limits alone leave the expensive-request path open.
How is this a security issue and not just an ops issue? Because it is adversarial and targeted. An attacker is deliberately exploiting the cost and capacity model to harm your availability or your finances. The defenses, identity-bound budgets, work-based limits, anomaly detection, are security controls even though the symptom shows up in ops and billing.
How Promptention helps
Visibility is the foundation of bounding consumption, and visibility is something we provide. Our prompt logging and activity monitoring give you real-time insight into how your LLM endpoints are being used, including the consumption and traffic patterns that signal abuse, so a denial-of-wallet campaign or an extraction-shaped flood shows up as a signal you can act on rather than a surprise on your invoice. Combined with the policy controls we help you enforce, that turns "how is this being used, and what is it costing me" from an open question into a monitored one.
Promptention's monitoring surfaces abuse and anomalous consumption patterns in real time, aligned to OWASP LLM10: Unbounded Consumption.

