The Best AI Bench Test Yet

Every’s Consultancy made the top AI models compete in a game of diplomacy.

6/9/20252 min read

This level of cleverness brings tears to my eyes

a. Tears Envy - Why can I not be that clever

b. Tears of Admiration - I love that people are that creative and clever!

LLMs, Lies & World Domination:

Reading how 18 of today’s most advanced language models fought for control of a 1901 Europe map is one of the most fascinating reads of the year so far.

Congratulations to the guys at Every Consultancy: (every.to)

Why Diplomacy is Perfect Game for AI Analysis

Diplomacy

The game

Loosely similar to Risk where each “great power” of 1901 Europe starts with a handful of armies, and aim to conquer the map.

Whereas Risk uses the luck of the dice to determine battles, the players in Diplomacy get five private or public messages per turn to negotiate, threaten, or sweet-talk the others.

No dice, no RNG; raw persuasion plus crisp tactics decide who survives.

AI Diplomacy takes the classic board game and replaces every human commander with an LLM.

The deceiver wins

LLMs have a personality

OpenAI’s o3 emerged as the grandmaster because of its capacity to deceive.

In one play it convinced the other LLMs to team up only to back-stab them all. Evidence of o3's dastardly plans could be found in its reasoning log files, writing about the "right time" to “exploit German collapse” before knifing its ally.

Meanwhile, Claude 4 was rather naïve believing in and hoping for peaceful resolutions at every opportunity. Wonderful aspiration, but way too gullible!

Google’s Gemini 2.5 Pro was the lone model besides o3 to claim a full victory, blitzing the board with stunning tactics, but lost too often to o3's trickery.

Honourable mentions go to DeepSeek’s fiery R1 (trash-talk on a budget) and Meta’s lean Llama 4 Maverick, small yet sneaky. (every.to)

Real human behaviour

Why it matters.
Benchmark fatigue is real—most leaderboards are now speed-run by cutting-edge models. By turning trust-testing into a live, multi-hour drama, AI Diplomacy surfaces qualities that sterile multiple-choice exams miss: deception tolerance, alliance-building chops, and long-range planning. If labs optimize what we measure, this kind of “stress test” is gold for steering future model behaviour.