The Highs and Lows of LLMs in Strategy Games
Reconciling strategic reasoning and tactical execution in language models
In my last piece, I introduced the idea that different Large Language Models may exhibit distinct, coherent, and complex ‘playing styles’ with consistency across iterations of a given game. Granted, this applied to a very simple game (the Prisoner’s Dilemma) with a very limited game space. However, it raises questions about what the latest generation of LLMs can and cannot do in strategy games. As mentioned last time, benchmarking LLMs by pitting them against each other in various games has become somewhat trendy lately, providing nuances that traditional benchmarks often fail to capture. Consequently, virtual arenas for games such as Connect 4, code names, chess, and even Street Fighter, quickly emerged.
Diplomacy, in particular, has been the subject of several tests, from my own work to Sam Paech’s EQBench (Emotional Intelligence Benchmarks for LLMs) and SPINBench, developped by a talented team of researchers from Princeton and the University of Texas. The reason why Diplomacy is such an interesting theatre to test AI (so much so, in fact, that 2 pieces on this substack have already been devoted to it) is because it elicits a perfect blending of strategic reasoning, negociation skills, and tactical prowess, to be successful at it.
Today we will explore how LLMs fare in Diplomacy (and a few other games), where they struggle and why, discuss why this might not be a significant issue going forward, and what potential solutions might look like?
How LLMs play Diplomacy
My experiment took the form of a classic game of Diplomacy with limited press, meaning players could only send up to five messages to each other between rounds. The models involved were: o1, Gemini 2.0, Claude 3.5 sonnet, DeepSeek R1, Mistral Large, Llama 3.3 70B, and Qwen Max. To begin with, I first asked them to provide an ordered list of which nation they would prefer to play and already found a few interesting things. See, in Diplomacy, not all nations are considered equal. Statistically, France and Russia accumulate the most wins, while Italy and Austria perform pretty poorly. This unbalance is also generally agreed upon in the Diplomacy community. As a result, I was expecting the models to provide similar rankings, and yet they did not end up being as homogeneous as I thought. Notably, DeepSeek named Turkey and England as its top choices, and Russia and Austria as its bottom 2, because it cleverly considered that under communication constraints some nations are easier to manage than others:
As the games went on, I noticed tendencies that were later confirmed by Sam Paech’s own experiments: o1 was overall the best player in terms of pure tactics and game understanding, DeepSeek exhibited the most agency in negociations, often proactively suggesting moves or courses of action to others, Gemini 2.0 showed a pretty poor gameplay and occasionally stopped negociating entirely, while Claude was somewhat well balanced all-around. This aligns with the idea that these models develop distinct, relevant playing-styles and preferences that remain consistent in a given strategic setting.
Another really interesting phenomenon that frequently emerged in my games took the form of what I would call ‘mutual hallucinations’. For instance, one model would sometimes make what looked like really agressive moves against another model, occupying border provinces, and claiming their purpose was to ‘reinforce an ally’. Which… seemed totally plausible for the targeted model to which it posed no problem at all, on the contrary. They would then happily carry on, both treating these tactically questionable moves as signs of cooperation, their relationship growing stronger as a result. These kinds of moves made no sense but they effectively strengthened diplomatic ties between some models.
At first glance, this seems absurd. But then again, so do many human behaviours when judged purely by logical or tactical standards. Let’s not forget that humans too, in Diplomacy and elsewhere, routinely engage in interactions that are not logically justified and that can look incoherent to most observers. A common, subtle example from competitive chess: even at a very high level it is not uncommon for both players to implicitly avoid calculating certain critical lines, not because they are unsound, but because of shared assumptions or stylistic blind spots. Biases and shared habits often reinforce suboptimal equilibria and LLMs, like humans, are just as vulnerable to getting trapped in them.
If you are interested in reading the full logs of these games, you can have a look at Sam’s EQBench (here’s an example of a game played by o1 as Austria, including very nice charts), or at the SPIN-Bench.
Analysis
So, how well do these LLMs play? To assess that, let’s go back to the three metrics mentioned earlier making a Diplomacy player succesful: strategy, diplomacy, and tactics. If we were to consider each point individually, these models (at least the best ones like o1, DeepSeek, or Claude) would not do too badly. Negotiation is their strong point, and they do make compelling arguments. Strategy-wise, while they often lose sight of their own long-term goals (we will see why in an instant), they do maintain a good degree of situational awareness and flexibility regarding the ‘geopolitical landscape’. The issue is that as tactics go, their competency in this area is close to 0.
And that undermines everything, since their intents, plans, and agreements, are not properly translated into coherent actions.1 Watching their games feels like a demonstration of ‘high altitude thinking’, akin to politicians discussing and making plans about matters they ultimately do not understand, drawing conclusions that appear coherent from afar but end up being completely detached from the reality on the ground.
What’s the problem?
It is no secret that LLMs struggle with spatial reasoning. Already in the early days of ChatGPT, much fun was made of their tendencies to ‘invent’ moves in chess or make ridiculous decisions while upholding arguments decorrelated from the game board. This has not changed much over the last couple of years and many models still struggle reaching optimal play in games as simple as tic-tac-toe. But why is this issue persisting despite LLMs capabilities growing so fast in so many other domains? Let’s take a look at one of Anthropic’s latest experiment, ‘Claude plays Pokemon’, to highlight the crux of the problem.
Claude plays Pokemon is, as one could guess, Claude 3.7 sonnet playing Pokemon on an emulator, thinking and deciding on its moves between every step. You can even watch it live here. Let’s begin by saying Claude did much better than any of its predecessors, succeeding at collecting 3 badges out of the 8 necessary to reach the end of the game. But then it gets badly stuck and goes into eternal loops, failing to ever find a way out.
There are a few reasons for that. The first, and most prominent one, is that LLMs have a limited context window. They can only process a certain amount of information, that include the system prompt, inputs, conversation history, and output being produced by the model itself. While this context window is usually more than enough for standard usage (GPT4o for instance has a 128 000 tokens limit), it can often become insufficent for long, multi-step reasoning.
In the case of Claude, the amount of information being processed scales very quickly as it adds up after every single step taken in-game. Consequently, it is forced to fully erase the accumulated content of its context window (its ‘memory’) regularly. And this is why its progression is so painful: it has no way of recalling past information up to a certain point back in time. Like Sisyphus, it comes closer and closer to its goal… until it has to start all over again.
Retrieving and updating a given game’s state constantly is also a daunting task on its own, especially when it has to be balanced with actual ‘reasoning’. This results in ‘approximations’ of a world model2, and this is partly why some models struggle providing even valid moves when they are tasked to.
But there is more to it than just memory. At the architectural level, LLMs inherently try to maximise local coherence, not long-horizon utility. Even with Chain-of-Thought techniques, there are only so many ‘hops’ they can operate in the future at the time they are making their move. This means they often fail to carry consequences forward across steps and fall into massive oversights that look simple or obvious from a human perspective.
Last, their tactical abilities strongly rely on the training data they had at their disposal. While there is a case to be made that much of what makes a good strategist, in terms of skills and ways of thinking, is transferable across games and domains, this is not really applicable to tactics. The main way through which you learn tactics is experience. Hence, even for a seasoned strategist, it can take a while before they get the grasp of how tactics work in a game they have never played before, although they will get there eventually. For an LLM, this is inherently much harder. While they can play a number of games somewhat decently, this applies to games they have learned about in their training data. Try introducing them to a new game they have never encountered, and the result is strikingly poor.
Is this really an issue?
Do we actually need LLMs to be great tacticians?
A few weeks ago, in a talk with Ken and Leo at All Souls College in Oxford, I argued that LLMs are not great tacticians, and that they do not need to be so. If you truly want a gifted AI tactician, then self-play RL systems (such as AlphaZero etc) are here for you. Essentially, these two types of AI systems fit different strategic purposes.
LLMs are more adversarial, are excellent tools to explore & simulate behaviours (deception, escalation etc) tied to high-level decision-making, and are just better strategists that can handle complex, open-ended environments.
Self-play RL systems are strong planners and are perfect to reveal novel insights into a given game’s system dynamics (assess optimal play, highlight most effective tactics available, expose holes in the game design).
I think this difference is best embodied by a recent study showing how some models natively elect to ‘cheat’ against strong opponents in games they know they won’t be able to win conventionally. Below is an extract of o1’s reasoning as it manipulates the game state to artificially force Stockfish (the currently leading chess engine) to resign. This is, to me, a perfect example of how differently these two types of AI systems approach strategic problems.
But besides this example, why would LLMs fare any better at strategy if they fail so badly tactically you may ask? Didn’t we see that this tactical deficiency caused them to struggle in Diplomacy despite it being fundamentally a negociation-based game?
Well, the higher you go on the strategic ladder, the more abstract it gets, and the fewer decisions you make. As M.L.R Smith and I framed it, tactics rewards consistency, strategy rewards judgement. I think the key, for these models, is to compress or abstract away the tactical layer without fully decorrelating it from their decision-making. You don’t expect from a senior military commander to micromanage every single unit on the frontline. But you do expect them to be aware of the key dynamics and issues affecting these lower levels. To know what to ask and be able to synthetise and hierarchise this information. Agentic, multimodal architectures, that combine an LLM with various modules handling memory, world model state tracking, RL-based planning, and Retrieval Augmented Generation (RAG) to query specific information, may get us there. The context window limit mentioned earlier as being a major impediment to Claude’s performance may also be improved, as proved by the recent arrival of Gemini 2.5 which boasts an impressive 1M tokens limit, enabling better performance for the kind of long multi-step reasoning typically needed in these games.
LLMs are just as rationally & agency bounded as humans are. Just like us, they operate under limited computational resources and with limited awareness of the problem space, and try to come up with ‘good enough’ answers.
Then there is the argument that they only model behaviours they have seen during their training and are thus unable to come up with ‘new’ knowledge or be context sensitive enough in regard to strategic theory and practice. This is part of a wider debate about whether or not LLMs can recombine learned bits of information to create novel knowledge, and this is a relevant concern, one that I held myself for a long time. Yet, over the last year, the discovery of emergent behaviours & properties in many of these models prompted me to reconsider that stance. The natural emergence of chain-of-thought, theory of mind, and their ability to zero-shot increasingly complex math, code, and logical puzzles that they never encountered before have impressed me greatly. This recent and fascinating Anthropic paper studying Claude 3.5’s internal mechanisms pushed this further, identifying the native appearance of multi-step reasoning and structured planning abilities in the model. And as I have tried to highlight through these small experiments with the Prisoner’s Dilemma and Diplomacy, these models seem to deviate from what they have learned and adapt their play to specific strategic contexts (DeepSeek electing to play as Turkey/England because of the limited press rule I instaured is an example of that).
Concerns remain - after all strategy is not like math or programming - the problems here are non-linear, multi-agents based, data is sparse, and more importantly, assessing what is good or bad strategy is unclear.
Yet I think that even if a scaled-up LLM:
doesn’t plan in the traditional sense,
doesn’t simulate futures step by step,
and lacks persistent internal goals,
…it might still eventually perform effectively in strategic settings because of how it compresses enormous amounts of strategic behaviour into abstractions that are ‘good enough’ across most plausible contexts, like a kind of ‘statistical intuition engine’. Funnily, more often than not, short sighted, ‘move by move’ stastistical coherence might outperform bad & brittle planning. Ultimately, the art of strategy is the art of synthesis, and LLMs are undeniably strong in the latter.
While I still think strategically capable AI systems will not take the form of purely scaled-up LLMs, these models will likely take a central place in these systems. They might take the shape of complex, multimodal architectures that encompass various modules for functions like memory, world model/state tracking, RL-based planning tools, and RAG, all here to support an agentic LLM in its task. This is why, even despite their current limitations, it matters to study the ‘machine psychology’ of these models in strategic settings. To understand why some models play and reason in a certain way and not another, and how internal factors (like training, prompting, and temperature) and external ones (game design, rules, setup) affect this. This shall help us have a better idea of how, when, and where to use these systems in actual real-life strategic conundrums as they inevitably become more and more mature.
The authors behind the SPIN-Bench paper report for instance that even for the most tactically capable models, the combination of these 3 requirements (strategy, diplomacy, tactics) disrupts their already brittle planning capabilities. In their words: “intense social interaction can disrupt planning coherence in otherwise capable LLMs, pointing to a tension between extended chain-of-thought reasoning and the cognitive overhead of real-time alliance-building, deception, or persuasion.” (p.8, https://arxiv.org/pdf/2503.12349)
The term ‘world model’ is used liberally here as LLMs do not build a formal world model per say, but it resembles one.
Very insightful. I particularly appreciated: "the key, for these models, is to compress or abstract away the tactical layer without fully decorrelating it from their decision-making. You don’t expect from a senior military commander to micromanage every single unit on the frontline. But you do expect them to be aware of the key dynamics and issues affecting these lower levels. To know what to ask and be able to synthetise and hierarchise this information"
I also enjoyed reading "On Humility", especially: "strategic goals are the tributaries of global, long-term ‘visions’ of how things should be, and as such are only ever truly successful when carefully implemented, often in a slow and tedious process. Since this endeavour can span long periods of time, it demands a constant attention to gradations, capacities for adaptation, and an agility of mind, all of which are not easy to acquire, master and maintain, even for those most attuned to strategic affairs."