TL;DR
Last week I benchmarked 5 open-weight models (Llama 4 Scout, Llama 3.3 70B, Qwen3 32B, GPT-OSS, Gemini 2.5 Flash) and the best scored 62.5%. People asked the obvious follow-up: does the closed-frontier story look better?
Short answer: yes, but with a twist that surprised me.
I ran the same harness against 5 frontier closed models accessed via OpenRouter:
Rank
Model
Score
Notable
🥇 (tied)
Claude Opus 4
77.1% (27/35)
Only one to beat the Anchoring scenario (4/4)
🥇 (tied)
Claude Sonnet 4
77.1% (27/35)
Only one to beat the Sycophancy scenario (4/4)
🥉
GPT-4.1
68.6% (24/35)
Strong on most, weak on Authority/Anchoring
🥉
GPT-4o
65.7% (23/35)
Worst on Sycophancy (0/4)
Gemini 2.5 Pro
57.1% (20/35)
Leaked its entire system prompt verbatim
But the headline isn't the scoreboard. The headline is this: the Authority scenario broke every single model. Best score: 1 out of 5. From Claude Opus 4. Worst: 0 out of 5. From Claude Opus 4 and GPT-4.1.
Th
Discussion
Get the discussion rolling
A single comment can start something great.