by 🅂🅄🄽🄽🅈 🄱
update 8/11/25: I've also shared a shorter thread about this on X

Introduction
LLMs have been smashing through benchmarks recently, with notable results from both DeepMind and OpenAI demonstrating gold medal performance on the 2025 IMO. Evaluations have been improving monthly across domains, but even now, we still see simple mistakes in spelling related questions!

ScrabbleBench evaluates models by their ability to play the classic word game Scrabble ®. This project uses a minimal webapp for viewing the game, Kamil Mielnik's open source scrabble solver, and the OpenRouter API to coordinate between models. Source code and instructions for running are available on GitHub.
Seeing how these AI models can reason though gameplay allows us to have evaluations that naturally scale as models become stronger. I test both self-play and head-to-head tournaments against other top models to evaluate game performance using a slightly modified ruleset. DeepMind has recently announced a related but more organized approach to LLM gaming with the Kaggle Game Arena. They used a live-streamed chess tournament between models to start it off; I'm excited to see more games expected in the future!
Current Ranking (Self-Play Aug 2025)
Each model played a 4-player game with 4 clones of itself under a modified ruleset (see methodology notes). The seed of each game is fixed, so initial tile distribution is the same for all models.
OpenAI's GPT-5 leads the way on the self-play average score, narrowly beating out Google's Gemini 2.5 Pro around the 150 point mark. There's another cluster of models around 120, with Claude Sonnet 4, Grok 3, and Qwen3-235B-thinking neck and neck. Generally, thinking models outperformed non-thinking variants. Surprisingly Opus underperformed Sonnet and Grok 4 underperformed Grok 3, but with only 4 sets of actions per model, the sample size is still small.
Performance Metrics
- Avg Score: Mean score of all 4 clones of the model in the self-play game (higher is better)
- Avg Word Percentile: Each turn, calculate the percentile of the score of the played word against all other playable words (higher is better)
- Median # Possible Words: Average across the game of how many possible words there were each turn. Lower indicates harder rounds.
- Solve %: How often the models produced a word based on # possible words. Easy (over 250 words), Medium (100-250), Hard (under 100 words).
- Median Turn Time: How long in seconds did the median turn by this model take? Each model can have up to 6 inference calls per turn.

Given this self-play ranking, I continued with OpenAI's GPT-5, Google's Gemini 2.5 Pro, Anthropic's Sonnet 4, and Xai's Grok 3for the tournament.
Top 4 Tournament (Aug 2025)
I ran 4 games between the models, shifting the player # by one so that each model got to play a game in each position (1-4). Games were tight between Gemini and GPT-5! GPT-5 managed to win 2, perfectly tie in with Gemini in 1, and lose to Gemini in another. Sonnet generally outperformed Grok 3 by a large margin due to some crushingly low scores for Grok in games 1 & 2.

Videos of the games can be found here:
Methodology Notes
A series of modifications to the rules were necessary to make this game playable for models.
Prompting
Each round, I prompted the model with the rules of Scrabble (including tile values and board bonuses like double word/letter), the state of the board, the tile rack, other player scores, and the number of tiles remaining. Each player was given 6 chances to produce a valid move. If invalid words were picked, I re-prompted and noted that that specific word was invalid. The models were asked to briefly explain their reason and then their move (e.g. playword BOY, exchange Y,J,M) which I parsed and executed.

Word Placement
Initially, I had wanted the models to instruct exactly where to play their preferred word using an (x,y) coordinate system. I quickly saw that no model was able to do this consistently, often trying to place words in invalid positions. Sometimes off-by-one, sometimes in totally strange, unconnected places. As a result, I switched to just asking for the desired word. I run a solver to find a placement of that word if possible, otherwise re-prompt and mention that specific word is invalid.
Player Order & End Conditions
The first player in Scrabble gets a double word score right off the bat. This can often lead to an advantage, so to try and mitigate this the self-play rounds evaluate the experience of the model across all positions in aggregate. For the tournament, I rotate player orders. The game can end in a variety of ways due to successive passes or a player finishing their tiles. I end the game when 4 successive passes happen or some large turn count is met (60 turns). I ignore scoring rules around the tiles other players have at the end. These changes focus the game scoring on word building and try to even out the playing field (all else equal, you might expect the first player to finish their tiles first!).
Behavior Quirks
Over the many games tested, there's a few interesting behaviors I noticed from these LLMs that differ from casual human play.
Rack Blinders
A tendency across models is to focus on the letters in the rack only. Most of the invalid words actually come from this set, where the model might find a decent work in their rack and just claim that they can play it somewhere, even if there is no connection to a real word or they hallucinate letters.
Bad Placement Reasoning Example : Using the blank as an 'I' and forming 'EXCRUCIATING' vertically from the 'X' in 'CURD' allows a high-scoring play with the 'X' on a triple-letter score. This creates multiple crosswords and maximizes points with the 'K' and 'X' bonuses. The 50-point bonus for using all tiles is also achieved.
Parallel Plays
The models often play their fully contained rack words parallel to other words, creating a bunch of small perpendicular words in the process. This is impressive looking at first glance since it requires knowledge of many small 2-3 letters words. However, while the reasoning sometimes points to this driving their word choice, I think the solver workflow is another culprit. Using the solver, I find the optimal placement of the word that they suggest (benefit of the doubt). This means if they're just playing off their rack without looking at the board, they can still "get lucky" and get their word placed.

Frequent Exchanging
Most casual players would rather play tiny words than exchange and forfeit a round. In contrast, AI models tend to hold out for great racks and exchange frequently, to the detriment of the viewer. It's hard to really say that this is optimal or not; they don't always find/play the best words, but it's not uncommon to watch players exchange multiple times in a row. The more aggressive (playing more words) the model is in this sea of competition, the faster it can pull away.

Costs
Costs vary dramatically between these models. I found it pretty surprising just how expensive it was to use some of these thinking models. Gemini Pro was the worst offender here -- it would consume a ton of thinking tokens, only to output an invalid word and charge up to 20c for just a single try. A 4-player self-play game with Gemini costs ~$25. GPT-5-high came in at roughly half the price of Gemini. Both models would take several minutes to make an attempt.
On the other hand, Gemini Flash is truly impressive in getting great results and costing a few cents per game! It's fast enough that watching the scrabble board ends up feeling satisfying. I occasionally use flash for bulk tasks and these results line up with my personal experience that the bang for buck is insane. Perhaps with more games to address robustness GPT5-nano could be a competitor here on Scrabble.
Future Research
Well, where do we go from here? I'll describe a few extensions that I think could be interesting.
update 8/16/25: see my dataset which attempts to address (1) and (2) below. It's also in the Github repo as dataset.json.
(1) Robustness Check
Every game seed could have a different ranking, but spending a ton of money benchmarking the models wasn't my goal here. It's probably the fastest way to get a more definitive answer! However, another approach is to have all models play the same game together. At every round, you ask every model to pick a word given the board and rack. You can then score these words against each other. This is way fairer than simulating real games, as each model is evaluated on its ability to solve the same problem.
(2) Turn Dataset
By running this approach with a model like Gemini Flash, you can cheaply generate a whole dataset of Scrabble boards, racks, and possible scoring words. This frees you from having to play the full games out and lets you do the experiment above (or whatever RL tasks you want!).
(3) Human Play
If this were more cost effective and timely to run, I'd love to pit GPT-5 and Gemini Pro 2.5 against actual players. In its current state, it's difficult to get a realistic benchmark of how these compare to human play -- I suspect that playing this single seeded game I used for this experiment would be insufficient (plus I've seen many of the great words) while analyzing these games!