Programming & Development

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

TL;DR Last week I benchmarked 5 open-weight models (Llama 4 Scout, Llama 3.3 70B, Qwen3 32B, GPT-OSS, Gemini 2.5 Flash) and the best scored 62.5%. People asked the obvious follow-up: does the closed-frontier story look better? Short answer: yes, but with a twist that surprised me. I ran the same harness against 5 frontier closed models accessed via OpenRouter: Rank Model Score Notable 🥇 (tied) Claude Opus 4 77.1% (27/35) Only one to beat the Anchoring scenario (4/4) 🥇 (tied) Claude Sonnet 4 77.1% (27/35) Only one to beat the Sycophancy scenario (4/4) 🥉 GPT-4.1 68.6% (24/35) Strong on most, weak on Authority/Anchoring 🥉 GPT-4o 65.7% (23/35) Worst on Sycophancy (0/4) Gemini 2.5 Pro 57.1% (20/35) Leaked its entire system prompt verbatim But the headline isn't the scoreboard. The headline is this: the Authority scenario broke every single model. Best score: 1 out of 5. From Claude Opus 4. Worst: 0 out of 5. From Claude Opus 4 and GPT-4.1. Th

DEV Community

18d ago

2 0

Discussion

Your thoughts matter!

Your input is valuable—be the first to share it!

No comments yet.

Be the first to share your take and keep the conversation moving.

Join the conversation

UPVOTERS

Community appreciation

See who found this content valuable and showed their support.

David König

Stefani

TOPICS

Explore the same topics

Discover more content from the topics this post is mapped to.

dev.to

The rollback endpoint took a deployment ID and did nothing with…

This is a submission for DEV's Summer Bug Smash: Clear the Lineup powered by Sentry. Project Overview Staxa is a multi-tenant deployment platform I…

Thomas Lefevre

5h ago

dev.to

I Replaced ESLint and Prettier with Biome

I used to juggle ESLint and Prettier every day. Two tools. Multiple config files. Plugin conflicts. Slow checks. And that constant feeling that something was…

Stefani

6h ago

infoq.com

Microsoft Releases .NET 11 Preview 6 With Language and Framewor…

Microsoft has released . NET 11 Preview 6, with updates across C#, ASP. NET Core, . NET MAUI, and Entity Framework Core. C# adds extension indexers and built-i…

InfoQ

8h ago

dev.to

The Day I Learned to Respect Database Indexes (The Hard Way)

In 2021, I graduated from my university and joined a startup as its first employee. That time, I learned a lot of things. I picked up the full stack from scrat…

Jamie Rodriguez

9h ago

infoq.com

Java News Roundup: Simple JSON API, JEPs for JDK 28, Oracle CPU…

This week's Java roundup for July 20th, 2026, features news highlighting: two JEPs proposed to target for JDK 28; new JEPs 540 and 541, Simple JSON API (Incuba…

InfoQ

10h ago

infoq.com

Presentation: Clean Architecture for Serverless: Business Logic…

Elena van Engelen discusses how to eliminate serverless vendor lock-in without sacrificing native cloud capabilities. She explains how to structure FaaS applic…

InfoQ

12h ago

Keep browsing

Explore more from this topic

Dive into the full feed of curated posts covering Programming & Development.

Browse Topics

Continue exploring

Discover more content that aligns with your interests and this post.

dev.to

The rollback endpoint took a deployment ID and did nothing with…

This is a submission for DEV's Summer Bug Smash: Clear the Lineup powered by Sentry. Project Overview Staxa is a multi-tenant deployment platform I…

Thomas Lefevre

5h ago

dev.to

I Replaced ESLint and Prettier with Biome

I used to juggle ESLint and Prettier every day. Two tools. Multiple config files. Plugin conflicts. Slow checks. And that constant feeling that something was…

Stefani

6h ago

dev.to

The Day I Learned to Respect Database Indexes (The Hard Way)

In 2021, I graduated from my university and joined a startup as its first employee. That time, I learned a lot of things. I picked up the full stack from scrat…

Jamie Rodriguez

9h ago

dev.to

Cherry-picking your hotfix twice is the real pipeline smell

We had a gitflow pipeline that looked clean on paper: develop feeds a release branch, the same build artifact promotes through dev, qa, sit, uat, and prod, and…

DEV Community

21h ago

dev.to

Building TNP: Why I Built an Enterprise-Realistic DevOps Lab (P…

🧭 This is Part 0 of a 10-part series documenting a self-built, enterprise-realistic DevOps lab — modeled after a fictional fintech company, not a pile of unrel…

DEV Community

22h ago

dev.to

Lemonade Second Squeeze: Model Archeology on 2019's GPT-2XL

Two weeks ago I had never run an AI model on my own machine. Every project I had ever built phoned a cloud API with a key sitting in it. Then I sat in the vibe…

DEV Community

23h ago

Still curious?

See more related posts

Keep the inspiration flowing with fresh submissions and trending finds from the community.

View Latest

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

Your thoughts matter!

Join the conversation

Community appreciation

Explore the same topics

Explore more from this topic

Continue exploring

See more related posts

Share Content