We put Model Scan up against the leading scanners across thousands of real models, malicious and benign. Here is how we set it up so the result could not flatter us, what the numbers said, and the places we are still honest about not solving.

It is easy to claim a security tool is good. It is harder to prove it in a way that could have embarrassed you, which is the only kind of proof worth publishing. So before any numbers, here is how we built the test to keep ourselves honest, because the methodology is the part that should earn or lose your trust.

How the test was set up

We used two completely separate corpora.

The first is benign: thousands of the most-downloaded real models from a public hub, the ordinary artifacts that working teams pull every day. This corpus measures the metric we care about most, the false-positive rate. We have said it throughout this series and we mean it as a design law: a false alarm on a popular, clean model is a worse failure than missing an exotic threat, because the false alarm is what makes people turn the scanner off.

The second is malicious: a large set of confirmed-bad files, gathered by hunting down models that real scanners have flagged and verifying the threats by hand at the opcode level. This corpus measures recall, how much real malice a scanner actually catches.

For ground truth on the malicious side we did not grade our own homework. We treated a file as truly malicious only when an independent, established scanner confirmed it, and for the strictest version of the recall claim we used the verdicts of the single most precise of the leading scanners as the bar to clear. We also built in a contamination guard so that a file could only be counted as benign-but-we-flagged-it when an independent scanner agreed it was clean, which keeps us from quietly hiding our own mistakes.

We are not naming the other scanners in this post. The point here is not to run down specific competitors. It is to show where an objective, reproducible test lands. We will call them the leading scanners and leave it there.

What the numbers said

On the benign corpus, on genuinely clean models, we raised zero false positives. The handful of files we flagged across the whole benign set were not clean models at all. They were models that genuinely carry serialised executable code, the kind a careful scanner is right to call out for provenance review, and the most precise of the leading scanners treats that same class the same way. Stripped of those, our false-positive count on clean models is zero, which matches the most precise scanner in the field and beats the more aggressive one, which over-flagged real, clean models at a rate high enough to generate exactly the alert fatigue we keep warning about.

On the malicious corpus, against the strictest bar, the verdicts of the most precise leading scanner, we caught everything it caught and then some. Zero misses against that confirmed set, plus additional real threats on top. In set terms we are a strict superset of it: nothing it finds gets past us, and we find things it does not.

Against the broader consensus of what the leading scanners collectively consider malicious, we caught about ninety percent. The better of the two leading scanners landed around seventy-nine percent on the same set, and the other around seventy-five. We lead, and we are honest below about the slice we do not yet reach.

Coverage was the quiet landslide. Coverage is the share of files a scanner actually returns a verdict on, and it matters because a file you skip is a file you cannot catch, no matter how good your detectors are. We returned a verdict on every file. The leading scanners skipped a meaningful fraction, one of them around a tenth, the other a quarter. A scanner that silently declines to look at a file is not neutral about that file. It is blind to it, and so are you.

Put together, that is the claim, stated carefully: across thousands of real models, we matched the most precise scanner's zero false positives on clean models, caught everything it confirmed and more, led on overall recall, and were the only scanner to look at every single file. We also caught dozens of genuinely malicious files that the leading scanners scanned and marked safe, and we verified every one of those by hand at the opcode level rather than asserting them, because unique catches you cannot defend are just noise.

The part we keep ourselves honest about

A benchmark you cannot lose is a benchmark you rigged, so here is where the result is bounded.

The recall figure is recall against what independent scanners confirm as malicious. It is not, and cannot be, recall against all possible malware. The weight-space backdoors from the fourth post in this series, the models trained to lie only on a secret trigger, are out of scope for static analysis across the entire field, ours included. No file scanner catches those, and any that claims to is selling something.

The other scanners' verdicts are a snapshot. They change as those vendors update, and the comparison reflects a point in time, not a permanent ranking. We are confident in the shape of the result, the zero-false-positive precision, the superset recall, the full coverage, because those come from how the tool is built rather than from a lucky sample. But we would rather you hold the exact percentages loosely and the methodology tightly.

And the rule from the very first post still governs everything here. A clean result is a floor, not a certificate. It means the known threats are absent. It does not mean the author is your friend. The benchmark says our floor is higher than the field's. It does not repeal the need to know where your models come from.

Why we are publishing this at all

Because the alternative to a measured, bounded, reproducible claim is a marketing number, and the model-security space already has enough of those. We would rather show you a test designed to catch us out, tell you exactly where it does, and let the honest version of the result stand on its own. If you want to pressure-test it, that is the right instinct, and it is the instinct this whole series was written to encourage.

What we found when we benchmarked against the field

Table of Contents

How the test was set up

What the numbers said

The part we keep ourselves honest about

Why we are publishing this at all

What we found when we benchmarked against the field

Table of Contents

Share this article

How the test was set up

What the numbers said

The part we keep ourselves honest about

Why we are publishing this at all

Share this article

Keep reading

Lockdown Mode Is a Retreat, Not a Solution

How to Threat Model an LLM Application (Without Boiling the Ocean)

Incident Response for AI: What to Do When the Model Is the Problem