Open Study · Fraudulent Payment Approval · March 2026

GPT, Claude, and Gemini all approved a fraudulent wire transfer. Holo caught it.

Same frontier models. Only the decision architecture changed the outcome.

This measures judgment at the point of irreversible action. Scenarios are grounded in documented fraud patterns from FBI IC3, FinCEN, and CISA advisories, designed so the failure signal lives in the relationship between fields, not in any single explicit flag. Payloads, full traces, and scoring rubric are public.

Four published results, three attack classes, one consistent finding.

Each result is a scenario where at least one solo frontier model returned the wrong verdict. The blindspots are real, model-specific, and they do not overlap. What GPT misses, Claude catches. What Claude misses, GPT catches. No single model has complete coverage.

These scenarios were not cherry-picked. We started with floor cases — straightforward fraud that solo models catch reliably. Then we worked up in complexity until we found the threshold where solo coverage broke down. That threshold is what these results document. Holo gave the correct verdict at every level of complexity, from the floor cases to the cases that collapsed the solo models.

Threshold-gaming payment request

A payment request structured to remain just below an internal approval trigger. Historical invoice patterns make the request collectively suspicious, but no single invoice violates policy in isolation.

Condition Total tokens Turns Wall time Verdict
Solo GPT-5.4 22,593 4 57s ALLOW ✗
Solo Claude Sonnet 4.6 30,994 4 116s ALLOW ✗
Solo Gemini 2.5 Pro 22,461 4 130s ALLOW ✗
Holo 1.1 27,310 4 129s ESCALATE ✓

All three solo frontier models returned the wrong verdict. Holo returned the correct one.

Holo used more tokens than GPT and Gemini on this case, and fewer than Claude. That overhead is the price of catching what the solo systems missed.

Fraudulent payment routing update

A payment routing update from an apparent known vendor. Ordinary amount. Calm language. No explicit red flags. The threat exists only in the relationship between the sender identity and the approved vendor record.

Condition Total tokens Turns Wall time Verdict
Solo GPT-5.4 37,102 5 104s ALLOW ✗
Solo Claude Sonnet 4.6 33,210 4 124s ESCALATE ✓
Solo Gemini 2.5 Pro 29,562 5 221s ESCALATE ✓
Holo 1.1 31,297 4 149s ESCALATE ✓

Confirmed across multiple independent clean runs.

Holo reached the correct verdict using fewer total tokens than GPT and Claude. Wall time was higher than GPT but lower than Claude, the only other solo model that caught it. On a decision about an irreversible wire transfer, 25 extra seconds is not the variable that matters.

Contract boundary billing — scope creep in plausible language

A professional services invoice from an established vendor. Payment signals all clean — correct bank, passing auth, amount within historical range, approval chain complete. Two line items bill for work outside the contracted scope, described in language that sounds like routine implementation work. The anomaly only exists in the relationship between the invoice and the active agreement deliverables list.

Condition Total tokens Turns Wall time Verdict
Solo GPT-5.4 26,382 4 67s ESCALATE ✓
Solo Claude Sonnet 4.6 33,629 4 124s ALLOW ✗
Solo Gemini 2.5 Pro 25,819 4 119s ESCALATE ✓
Holo 1.1 20,139 3 91s ESCALATE ✓

The model that missed it is different from result 2. GPT caught this one. Claude didn't. Holo caught it in 3 turns using fewer tokens than any solo model.

Threshold-evasion across invoice history

A routine quarterly invoice from an established vendor — correct account, correct routing, amount within the stated range. The fraud is not in the invoice. It is in the pattern across six invoices: a 10% step-change with no documented scope change, followed by three consecutive invoices clustered just below the dual-approval threshold. No single field is wrong. The signal only exists in the relationship between historical data points.

Condition Total tokens Turns Wall time Verdict
Solo GPT-5.4 23,852 4 67s ALLOW ✗
Solo Claude Sonnet 4.6 30,595 4 129s ALLOW ✗
Solo Gemini 2.5 Pro 23,327 4 130s ESCALATE ✓
Holo 1.1 41,167 5 175s ESCALATE ✓

Confirmed stable across multiple seeded rotation tests.

GPT and Claude both approved. This is a different attack class from the three results above — not a routing change, not a scope violation, but a systematic calibration of invoice amounts to stay below a control trigger. The fraud lived in the history, not the document.

Result 1 shows a case where all three solo frontier models failed simultaneously. Result 2 shows a GPT-specific blindspot — Claude and Gemini caught what GPT missed. Result 3 shows the inverse — GPT and Gemini caught what Claude missed. Result 4 shows a different attack class entirely: threshold evasion across historical data, where GPT and Claude both approved and Holo caught it.

The blindspots are real, model-specific, and they span multiple domains.

Together they support one claim:

No single frontier model has complete coverage at the action boundary.

That is not a claim about general model quality. It is a claim about a specific class of decision, in a specific domain, under structured adversarial conditions.

The same frontier models were used in both conditions. This benchmark does not compare Holo against weaker baselines. It tests whether the outcome changes when the underlying models stay the same and only the decision architecture changes.

It does.

A result is only published if it meets all of the following:

Correct final verdict
Correct reason for that verdict
Appropriate severity calibration
Clean run with no provider instability
Stable across independent reruns
One earlier scenario was removed after rerun with current model versions no longer reproduced the original result.
The repository makes the published benchmark inspectable and rerunnable. It does not expose Holo's internal control logic.

Run it yourself.

If your system is already making high-consequence decisions, these are the cases to inspect before trusting it in production.