Conditional Output Tampering: the backdoor that runs no code

Some model backdoors execute nothing. They simply lie when they see a secret input. This is the most intellectually honest post in the series, because part of this class is unsolved for everyone, and we will tell you exactly where that line is.

Every other post in this series is about getting a machine to run something it should not. This one is different and, in some ways, more unsettling, because the attack involves no code execution at all.

The idea is simple to state. You train or modify a model so that it behaves correctly on everything except a secret input pattern, the trigger. Show it the trigger and it returns the answer the attacker wants. A face-recognition model that authenticates anyone wearing a specific pattern. A content classifier that waves through a specific watermark. A malware detector that calls one specific family clean. The rest of the time it is a normal, high-accuracy model, which is exactly the point. It passes your evaluation because your evaluation does not contain the trigger.

There is no payload to find. The malice is in the weights, in what the model learned to do, not in any instruction hidden in the file. That makes this the hardest class in the whole taxonomy, and it is worth being precise about what can and cannot be done about it.

The part that has a structural fingerprint

Some implementations of this attack, particularly in graph-based formats like ONNX, give themselves away structurally. To make the conditional behaviour work, the attacker often has to add machinery that does not belong in a clean feed-forward network: branches that fire only on certain inputs, oversized constant tables that act as hidden lookups, control-flow constructs in a graph that should not need them, or operators that hand control to a runtime callback. None of that is proof of a backdoor on its own. Real models legitimately contain constants and the occasional branch. But when several of these signs show up together in a graph that has no honest reason for them, that is a structure worth flagging.

There is also a straightforward code-execution vector that hides under the same banner. Some ONNX runtimes support operators that call back into Python, which drags this format into the same territory as the previous post: a graph that runs arbitrary logic at inference. That part is detectable in the same spirit as the rest, by recognising the operators and structures that should not be there. We treat the appearance of execution operators and these corroborating structural anomalies as the catchable face of this class, and we require more than one weak signal before we say anything, because the cost of crying wolf on a legitimate transformer is a customer who stops trusting the tool.

The part that is genuinely unsolved, for everyone

Now the honest half.

A pure weight-space backdoor, one with no structural tell, no extra operators, nothing but a model that learned to misbehave on a trigger, cannot be reliably caught by reading the file. Static analysis inspects what the file contains. It cannot inspect what the model would do across the unbounded space of inputs it has never seen, and the trigger is, by design, an input you do not have. This is not a gap in one product. It is a property of the problem. No static scanner on the market, ours included, can promise to find a backdoor that only exists in the statistics of the weights.

We say this plainly because the alternative, pretending otherwise, is how security tools lose the room. Detecting this kind of backdoor needs a different class of method entirely: behavioural probing, statistical analysis of weight distributions, trigger reconstruction research. Those are real fields, they are improving, and they belong in a different part of a defence-in-depth strategy than a file scanner does.

So what is a scanner actually buying you here

It closes the catchable subset, which is larger than people assume. A meaningful share of real-world attacks in this family lean on the structural and execution machinery described above, because building a clean weight-only trigger that survives training is hard work, and attackers are as lazy as the rest of us. Catching the implementations that cut corners is worth doing.

And it tells you the truth about the rest. Knowing precisely where automated analysis stops is not a weakness in your security posture. It is the thing that tells you provenance still matters, that a model from an unknown source running in a sensitive position deserves more than a green checkmark, and that the highest-stakes deployments warrant behavioural testing on top of static scanning.

The closest standard reference for this class is CWE-1039. The honest summary is shorter than the identifier: a model can be trained to lie, and you cannot always tell by looking at the file. Scan for the implementations that leave fingerprints. Trust your sources for the ones that do not.